Track: Search
Paper Title:
Web Page Classification with Heterogeneous Data Fusion
Authors:
Abstract:
Web pages are more than text and they contain much contextual and
structural information, e.g., the title, meta data, the anchor text,
etc., each of which can be seen as a data source or a
representation. Due to the different dimensionality and different
representing forms of these heterogeneous data sources, simply
putting them together would not greatly enhance the classification
performance. We observe that via a kernel function, different
dimensions and types of data sources can be represented into a
common format of kernel matrix, which can be seen as a generalized
similarity measure between web pages. In this sense, a kernel
learning approach is employed to fuse these heterogeneous data
sources. The experimental results on a collection of the ODP
database validate the advantages of the proposed method over any
single data source and the uniformly weighted combination of
heterogeneous data sources.