Track: Search
Paper Title:
Extraction and Classification of Dense Communities in the Web
Authors:
Abstract:
The World Wide Web (WWW) is rapidly becoming important for society
as a medium for sharing data, information and services, and
there is a growing interest in tools for understanding collective
behaviors and emerging phenomena in the WWW. In this paper we
focus on the problem of searching and classifying
communities in the web. Loosely speaking a community is a group
of pages related to a common interest. More formally communities
have been associated in the computer science literature with the
existence of a locally dense sub-graph of the web-graph (where web
pages are nodes and hyper-links are arcs of the web-graph). The
core of our contribution is a new scalable algorithm for finding
relatively dense subgraphs in massive graphs. We apply our
algorithm on web-graphs built on three publicly available large
crawls of the web (with raw sizes up to 120M nodes and 1G arcs).
The effectiveness of our algorithm in finding dense subgraphs is
demonstrated experimentally by embedding artificial communities in
the web-graph and counting how many of these are blindly found.
Effectiveness increases with the size and density of the
communities: it is close to 100% for communities of a thirty
nodes or more (even at low density). It is still about 80% even
for communities of twenty nodes with density over 50% of the
arcs present. At the lower extremes the algorithm catches 35% of
dense communities made of ten nodes. We complete our
Community Watch system by clustering the communities found in the
web-graph into homogeneous groups by topic and labelling each
group by representative keywords.