Track: Search
Paper Title:
An Adaptive Crawler for Locating Hidden-Web Entry Points
Authors:
Abstract:
In this paper we describe new adaptive crawling strategies
to efficiently locate the entry points to hidden-Web sources.
The fact that hidden-Web sources are very sparsely distributed
makes the problem of locating them especially challenging.
We deal with this problem by using the contents of
pages to focus the crawl on a topic; by prioritizing promising
links within the topic; and by also following links that may
not lead to immediate benefit. We propose a new framework
whereby crawlers automatically learn patterns of promising
links, and adapt their focus as the crawl progresses, thus
greatly reducing the amount of required manual setup and
tuning. Our experiments over real Web pages in a representative
set of domains indicate that online learning leads
to significant gains in harvest rates the adaptive crawlers
retrieve up to three times as many forms as crawlers that
use a fixed focus strategy.