WWW 2007, May 8-12, 2007, Banff, Alberta, Canada.
2007
978-1-59593-654-7/07/0005
1
Yang Sun, Ziming Zhuang, and C. Lee Giles
The Pennsylvania State University
University Park, PA, USA
Experimentation, Measurement.
crawler, robots exclusion protocol, robots.txt, search engine.
Without robots, there would probably be no search engines. Web search engines, digital libraries, and many other web applications such as offline browsers, internet marketing software and intelligent searching agents heavily depend on robots to acquire documents. Robots, also called ``spiders", ``crawlers", or ``bots", are self-acting agents that navigate around-the-clock through the hyperlinks of the Web, harvesting topical resources at zero costs of human management [4]. Because of the highly automated nature of the robots, rules must be made to regulate such crawling activities in order to prevent undesired impact to the server workload or access to non-public information.
The Robots Exclusion Protocol has been proposed [3] to provide advisory regulations for robots to follow. A file called robots.txt, which contains robot access policies, is deployed at the root directory of a website and accessible to all robots. Ethical robots read this file and obey the rules during their visit to the website. Despite the criticality of the robots.txt convention for both content providers and harvesters, little work has been done to investigate its usage in detail, especially at the scale of the Web. A study of the usage of robots.txt in UK universities and colleges investigated 163 websites and 53 robots.txt [2]. Robots.txt files were examined in terms of file size and the use of Robots Exclusion Protocol within the UK university domains. Drott [1] studied the usage of robots.txt as an aid for indexing to protect information on 60 samples from Fortune Global 500 company websites.
In this poster, we present the first large-scale study of robots.txt files covering the domains of education, government, news, and business. We present our observations on a considerably larger scale data than previous studies.
Our primary source to collect the initial URLs to feed our crawler is the Open Directory Project (DMOZ). Our collection from DMOZ covers three domains: education, news, and government. The university domain is further broken down into the American, European, and Asian university domains. We use the Fortune Top 1000 Company List as our data source in the business domain. Our crawler has performed five crawls for the same set of websites between Dec. 2005 and Oct. 2006.
Statistics: We crawled and investigated 7,593 unique websites including 600 government websites, 2,047 newspaper websites, 1,487 USA university websites, 1,420 European university websites, 1,039 Asian university websites, and 1,000 company websites.
Overall, the percentage of websites that have robots.txt has increased from 35% to 38.5% in the past 11 months (see Figure 1). Since search engines and intelligent searching agents become more important for accessing web information, this result is expected. The Robots Exclusion Protocol is more frequently adopted by government (44%), newspaper (46%) and university websites in the USA (45.9%). It is used extensively to protect information not to be offered to the public and balance workload for these websites.
There are 1056 named robots found in our dataset. The universal robot ``*" is the most frequently used robot in the User-Agent field and used 2744 times, which means 93.8% of robots.txt files have rules for the universal robots. 72.4% of the named robots appeared only once or twice.
Size and Length: An interesting observation is that the sizes and lengths of the robots.txt files on governmental websites are significantly larger than those from the other investigated domains. There are 26 files at a length of 68 lines and 4 at 253 lines. A reasonable explanation is that government websites tend to adopt more sophisticated robot restrictions, which results in larger and longer robots.txt files (see Table 1).
Crawl Delay: The field name ``Crawl-Delay" in robots.txt files has recently been used by web administrators. Web server administrators most likely use this field in their robots.txt files to arrange an affordable workload. The usage of Crawl-Delay increased from 40 cases (1.5%) in Dec. 2005 to 140 cases (4.8%) in Oct. 2006. The frequency of Crawl-Delay rules for different robots are shown in Table 2.
Incorrect Use: When we examine the content of the collected robots.txt files, a significant number of incorrect uses of the Robots Exclusion Protocol has been found. These incorrect uses include misnamed files, incorrect locations, and conflicting rules. Because of these incorrect uses, the access policy will be ignored by robots. We observe 13 cases of misnamed robots.txt files and find 23 files in which a specific name such as``crawler", ``robot", or ``webcrawlers" appear in the User-Agent field. General name in the User-Agent field is an incorrect use of the Robots Exclusion Protocol. We also found 282 robots.txt files with ambiguous rules and 18 files with conflicting rules (e.g. a directory is disallowed first and then allowed or allowed first and then disallowed).
The actual method for how robots will access the robots.txt is not specified in the Robots Exclusion Protocol. Open source crawlers such as``Websphinx", ``Jspider" and ``Nutch" checks the robots.txt file right before crawling each URL by default. We observe ``Googlebot", ``Yahoo! Slurp" and ``MSNbot" cache the robots.txt files for a website. During the modification of robots.txt file, these robots might disobey the rules. As a result, there are a few disallowed links appearing in these search engines.
Comments In Robots.txt: We have found cases in which comments in robots.txt files are not written for version or explanation but written for users or robot administrators1. Even a blog has been found in a robots.txt file2.
Issues: The rule ``Disallow: " can be understood as matching everything or nothing. Does the rule mean allow robots to crawl anything or nothing? Another issue is about the ``Crawl-Delay" field. Since this field is not in the original Robots Exclusion Protocol, not every robot recognizes this rule. How should webmasters design the robots.txt if they do not have the knowledge whether a robot recognizes the rule or not? There is no further discussion in the Robots Exclusion Protocol about these issues.