Although search engine providers have continually competed to expand their coverage, previous research results show that the current coverage of each search engine is significantly different [1], [2], [3] and the entire coverage of all search engines is only a fraction of the entire Web [4]. We studied the coverage problem by comparing crawling results with monitoring results assuming that a web monitor would go closer to collecting all the new information pages from given Web information source pages, than a crawler. We compared coverage of the information pages found by our Web monitor program with the coverage of these pages by Google, Yahoo, and MSN. In this paper, we compare the coverage and overlap of three well-known commercial search engines on information pages found by our Web monitor program.
We selected 260 Australian Government Web pages including both homepages for various departments and media release pages. The Local Government web pages include web pages from both the Tasmanian State Government and municipal government services in Tasmania, thus accounting for the higher number of homepages and smaller number of media release pages compared to the Federal Government. Obviously this sample set will not test the overall performance of Web search engines but we believe that they are not extreme cases with respect to reach-ability by crawlers and frequency of content updating. The Web monitor program, WebMon [5], was used to collect a data set from the sample web pages. At 2 hours intervals, it revisited the Web page to get new information. The monitor identifies new information pages by comparing old URL list (URLold) with new URL list (URLnew) of the same monitoring web page and eliminating filtering URLs (URLfilter). For each information page the URL, link text, and linked content are stored for further processing and URLnew becomes URLold. We collected new Web information pages from August 2005 to October 2006. In total 15,770 new Web pages were collect from the 260 sample Web pages. These are public web pages which should be readily accessible to any web crawler. To check coverage by search engines, we do not simply retrieve the URL as the content may have changed. Rather we submit a query with link text of the collected web page and then check if the page is included amongst those retrieved. We considered 100 search results because 95.5 % of positive results are in the top 100 results with confidence level 95% and confidence interval 5%. Random sampling from the entire data set was necessary in this evaluation because search engines constrain or monitor the number of automated searches by same user / IP. For each month, we sampled the data set as follows with 95% confidence levels and a 5% confidence interval. 4,203 samples were selected, 23% of all monitoring results.
Domains | Web Sites | Monitored Pages |
---|---|---|
Federal Government Homepages | 14(5%) | 1,125(7%) |
Federal Government Media pages | 118(45%) | 8,825(56%) |
Local Government Homepages | 111(43%) | 2,660(17%) |
Local Government Media pages | 17(7%) | 3,173(20%) |
Total | 260 | 15,770 |
The overall coverage results for the three search engines are summarized in Table 2. The coverage performance is the proportion of pages or positive result ratio. Google gives the highest overall return and MSN the lowest. Overall Google returns 54% of the monitored pages and MSN 23%. That is they miss from 46% to 77% of the Web information pages that have been posted. The search engines also perform differently across different areas. For Google, local government media release pages give the best results, while local home pages give the worst return. In contrast for both MSN and Yahoo, local government media release pages give the worst results.
Domains | Sampled Pages | Yahoo | MSN | |
---|---|---|---|---|
Federal Government Homepages | 289 | 153(53%) | 87(30%) | 106(37%) |
Federal Government Media pages | 2,328 | 1,316(57%) | 930(40%) | 700(30%) |
Local Government Homepages | 724 | 258(36%) | 135(19%) | 115(16%) |
Local Government Media pages | 862 | 544(63%) | 102(12%) | 32(4%) |
Total | 4,203 | 2,271(54%) | 1,254(30%) | 953(37%) |
Figure 1 illustrates coverage trends during the monitoring period. The month by month results show that Google is consistently the best with Yahoo second, except for an anomalous period at the end, and MSN third. Google and MSN search engines broadly give higher returns in more recent months. This might have been because of improved crawling during the period, but is more likely that they might use crawled date or indexed date as one of results ranking factors. Yahoo does not improve over time, but the sudden change at the end suggests possible changes to the way they crawl the Web.
Total unique positive returns (TUPR) are 2,665, 63.4% of the monitored web pages. It is calculated as follows:
TUPR=G(2,271)+Y(1,254)+M(953)-GM(782)-GY(974)-MY(490)+GMY(433), where G, Y, M, GM, GY, MY, and GMY represent positive results from Google and their overlapped positive returns (see Figure 1 bottom).
Overlap ratio of all search engines is 16.2% (433/2,665) and overlap ratios between pairs of search engines are as follows:
This result means Google dominates the other search engines because 78% (974/1,254) of Yahoo's positive results are overlapped by Google and 82% (782/953) of MSN's positive results. This result does not suggest a significant improvement by using a meta-search engine.
In this paper we studied coverage and overlap of three commercial search engines (Google, Yahoo, and MSN) using 15,770 Web information pages, which were collected from 260 Australian federal and local government Web pages for 15 months. We found that (1) overall coverage of all three commercial search engines is 63.4% and individually they vary from 22.7% to 54.0%, (2) overall overlap is 16.2 %, which is large compared to other studies [1], [3], and (3) one search engine (Google) is dominant over other search engines, and covers 85% of all unique search returns. We need to enhance coverage by employing dynamic scheduling strategies or use other Web information technologies such as Web monitoring and we need to reconsider the value of meta-search, because our results, especially (2) and (3), weaken the meta-search research assumption of the low coverage of each search engine and low dominance by any one search engine.
This work is supported by the Asian Office of Aerospace Research and Development (AOARD) (AOARD-06-4006)
[1] A study of results overlap and uniqueness among major web search engines, Information Processing and Management, 42 (5) 2006: p.1379 - 1391
,[2] A Comparative Study of Web Search Service Performance, Annual Conference of the American Society for Information Science 1988
,[3] A technique for measuring the relative size and overlap of public Web search engines, WWW7: The Seventh International World Wide Web Conference 1988. Brisbane, Australia
,[4] Searching the World Wide Web, Science 1988. p280
,[5] Web Information Management System: Personalization and Generalization, IADIS International Conference WWW/Internet 2003 2003
,