Track: XML
Paper Title:
Adaptive Record Extraction From Web Pages
Authors:
Abstract:
We describe an adaptive method for extracting records from
web pages. Our algorithm combines a weighted tree matching
metric with clustering for obtaining data extraction patterns.
We compare our method experimentally to the state-of-the-art,
and show that our approach is very competitive
for rigidly-structured records (such as product descriptions)
and far superior for loosely-structured records. (such as entries
on blogs).