Data Mining and the Web-Past, Present and Future
The World Wide Web is rapidly emerging as an important medium for transacting commerce as well as for the dissemination of information related to a wide range of topics (e.g., business, government, recreation). According to most predictions, the majority of human information will be available on the Web in ten years. These huge amounts of data raise a grand challenge, namely, how to turn the Web into a more useful information utility. Crawlers, search engines and Web directories like Yahoo! constitute the state-of-the-art tools for information retrieval on the Web today. Crawlers for the major search engines retrieve Web pages on which full-text indexes are constructed. A user query is simply a list of keywords (with some additional operators), and the query response is a list of pages ranked based on their similarity to the query. Today’s search tools, however, are plagued by the following four problems: (1) the abundance problem, that is, the phenomenon of hundreds of irrelevant documents being returned in response to a search query, (2) limited coverage of the Web, (3) a limited query interface that is based on syntactic keyword-oriented search, and (4) limited customization to individual users. These problems, in turn, can be attributed to the following characteristics of the Web. First and foremost, the Web is a huge, diverse and dynamic collection of interlinked hypertext documents. There are about 300 million pages on the Web today with about 1 million being added daily. Furthermore, it is widely believed that 99% of the information on the Web is of no interest to 99% of the people. Second, except for hyperlinks, the Web is largely unstructured. Finally, most information on the Web is in the form of HTML documents for which analysis and extraction of content is very difcult. Furthermore, the contents of many internet sources are hidden behind search interfaces and, thus, cannot be indexed HTML documents are dynamically generated by these sources, in response to queries, using data stored in commercial DBMSs. The question therefore is: how can we overcome these and other challenges that impede the Web resource discovery process? Fortunately, new and sophisticated techniques that have been developed in the area of data mining (also known as knowledge discovery), can aid in the extraction of useful information from the web. Data mining algorithms have been shown to scale well for large data sets and have been successfully applied to several areas like medical diagnosis, weather prediction, credit approval, customer segmentation, marketing and fraud detection. In this paper, we begin by reviewing popular data mining techniques like association rules, classication, clustering and outlier detection. We provide a brief description of each technique as well as efcient algorithms for implementing the technique. We then discuss algorithms for discovering Web, hypertext and hyperlink structure, that have been proposed by researchers in recent years. The key difference between these algorithms and earlier data mining algorithms is that the latter take hyperlink information into account. Finally, we conclude by listing research issues that still remain to be addressed in the area of Web Mining. 2 Data Mining Techniques In this section, we briey describe key data mining algorithms that have been developed for large databases. A number of these algorithms are also applicable in the Web context and can be used to nd related Web pages, as well as to cluster and categorize them.