Track: Search
Paper Title:
Robust Methodologies for Modeling Web Click Distributions
Authors:
Abstract:
Metrics such as click counts are vital to online businesses
but their measurement has been problematic due to inclusion of
high variance robot traffic.
We posit that by applying statistical methods more rigorous than have
been employed to date that we can build a robust model of the
distribution of clicks following which we can set probabilistically sound
thresholds to address outliers and robots.
Prior research in this domain has used inappropriate
statistical methodology to model distributions and current industrial
practice eschews this research for conservative ad-hoc click-level thresholds.
Prevailing belief is that such distributions are scale-free power law
distributions but using more rigorous statistical methods we find the
best description of the data is instead provided by a
scale-sensitive Zipf-Mandelbrot mixture distribution.
Our results are based on ten datasets from various verticals in the
Yahoo domain.
Since mixture models can overfit the data
we take care to use the BIC log-likelihood method
which penalizes overly complex models.
Using a mixture model in the web activity domain makes sense because
there are likely multiple classes of users.
In particular, we have
noticed that there is a significantly large set of ``users'' that visit the
Yahoo portal exactly once a day. We surmise these may be robots
testing internet connectivity by pinging the Yahoo main website.
Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against theoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems.