The Web is a constantly updating source of information.
A large number of latest developments, events and
commentaries are constantly being posted on the Web
in the form of blogs, news articles, usenet posts etc.
In order to help the users
avoid the tedious task of periodically searching for
updated information on the Web,
some of the search
engines provide an ``alert'' service
(e.g. Google alerts [16] or Live alerts [26]).
The main idea is to let the users create profiles (e.g. by
specifying a few query keywords) describing the information
for which they would like to receive updates.
After that, the search engines are continuously
monitoring
the newly collected information from the Web and will
alert the user (e.g. through email) whenever
there is new information that matches the user's profile.
In this way, the user can stay current with a
developing news story, track a good deal on a desired
product or follow who is writing about her blog.
The continuous monitoring of the Web to match a given
set of user-defined keywords, is also known
as continuous querying.
Typically, the results of a continuous query can be
returned to the user either in an ``as-found'' basis
or ``in-batch'' after a particular time interval (e.g. once a day).
Although running continuous queries on the Web
can potentially help the users to stay current with important
updates, in general, the amount of information returned as updates
to the user can be ``unbounded''.
For example, if the user is following a very controversial
or popular topic, she may receive hundreds of updated pages as an alert,
and may thus be overwhelmed by this huge amount of information.
Since a user typically can allow a limited time for
comprehending the information delivered to her, one
way to alleviate this problem is to allow the user to
restrict (or bound)1
the number of returned results within a particular time
interval.
More specifically,
the user may decide that, say, every day, she is interested
in reading only the 10 most relevant updates to her continuous
query and would like to receive only those updates.
For the cases where it is acceptable to return the
results in-batch the solution is straightforward:
we first collect all the relevant results within a day, we rank them
and then return the top-10 to the user.
However, in the cases where the ``freshness'' of the
results is very important, and thus we need to return
them as-found, the user is not willing to wait until
we collect all the relevant results and return them at the end of the query period.
For example, if a user is tracking Web pages describing digital cameras
offered for sale, she would like to know the
10 best pages according to some specification as soon as they appear, since the
cameras may be sold after a short period of time.
Returning the best results in an as-found basis
to a given continuous query involves two main challenges.
First, the potentially relevant results within a time period (e.g. a day)
are not known in advance. Without knowing all the relevant
results how can we find the top- among them to
return to the user?
Second, the points in time where the top- relevant
results will appear are also not known in advance.
Given that the freshness of results is very important,
how can we ensure that we return the top- results
as soon as they appear?
Regarding the first challenge, clearly we will have to
wait until we see all results in order to calculate
the exact top-. However, in a practical scenario,
we may safely assume that the user is willing to exchange
some imprecision in the top- results for a greater
freshness.
For example, in our digital camera example,
the user may be happy to receive the 10 deals out of which
only 9 belong to the top-10,
but receive them soon enough to actually be able to buy the products.
Given this relevance/freshness tradeoff, in this paper we present
an optimization method for bounded continuous search queries on the Web.
More specifically, our goal is to extract the top- relevant
pages that appear over a specified time interval on the Web and
return them to the user as soon as possible.
Our proposed solution utilizes principles
from the field of optimal stopping [34]
in order to realize fresh, high quality and a bounded number of search results during a
specified execution time of a continuous query.
Optimal stopping is a well-known problem in the field
of financial mathematics.
The main idea of this paper is to consider the development of the relevance
of versions of Web pages as relevance charts and to treat the problem
of estimating top-k results as-found similar to a basic problem in financial
mathematics, the problem of finding the buying or selling points for stocks at optimal prices.
In summary, our contributions in this paper are:
- We define bounded continuous search queries as standing
search queries that extract the estimated top-k documents from a specific Web
area over a period of time.
Bounded search queries have the advantage that the amount of information
returned to a user is controlled without further user interaction
in contrast to many previous approaches in the field of document filtering or
topic tracking.
- Considering bounded continuous queries we demonstrate that there is a
tradeoff between freshness and the quality of query results.
- We present and evaluate a new method to optimize the retrieval quality
for the cases where up-to-date information is required by the user. The
new approach is based on
the optimal stopping theory and estimates the relative ranking values
of future documents based on previous observations in a stream of documents.
In the next section we start our discussion by presenting
the strategy that is presently employed by the current search engines.
In section we demonstrate that in the cases
where we need to obtain information as up-to-date as possible, current approaches
may return sub-optimal results to the users.
In Section we define a new language
for bounded continuous search queries and present our optimization approach which utilizes
principles from the field of optimal stopping .
Finally, in the experimental section we verify that our new
method can generate fresher and more relevant results for a variety of continuous queries.
We conclude with an overview of related research and a summary.
1 Periodic Evaluation Method
In this section we give basic definitions and present a common
strategy to process bounded continuous search queries that is applied by Web search
systems and used as a reference strategy in this paper.
In this work we consider a simple stream of documents or versions of a specific document
.
The index corresponds to the time a document is obtained from the Web in a push or a
pull manner [21].2
Definition:
We consider bounding conditions that are specified by the maximal number
of documents to be returned. A bounding condition provided by a user corresponds to the
maximal information load a user is willing to accept with respect to a query.
It is obvious that threshold-based information filtering methods presented in the field of
topic tracking and detection [2] are not bounded.
We consider query profiles that are determined by a set of query terms provided by a user.
We may thereby assume that query profiles, similar to documents, may be
expressed in a term vector space. Well-known methods from Information Retrieval may
therefore be applied to compute a distance between a query profile and a document.
Based on the tf-idf measure [32] we may apply the function
|
(1) |
to compute the distance between a document and a query profile (i.e. a set of query terms)
[39].
In the formula
the term (term frequency)
is the frequency of a term in a document .
N is the number of documents in the collection,
is the inverse document frequency
where is the number of documents in the collection that contain .
Formula 1 is one way to estimate the relevance of a document with respect
to a query. In this paper in order to keep our discussion more concise whenever we
talk about relevance we will use equation 1, but in general
we could use any other estimation.
A problem of this measure is the computation of the term. At a specific point
in time the entire set of documents appearing in the stream is not yet available.
Possible approaches to this problem have been presented in [38],
[18] and [31].
In this work we consider each version of the information source or incoming document
as a single document 'collection'. The -term is based only
on the state of the information source at the current point in time.
If is the state of the information source at time
we denote the respective ranking of document
with respect to the query as
.
As described above in this work we make the simplifying assumption that at each point in time a single
(new) document is available.
We may therefore define
and
.
The function is used to obtain relevance estimations for new documents
in the sequence with respect to a query profile.
In this work we simplify the search for optimal documents by defining
quality as the estimation provided by the ranking function .
Quality is used as a synonym for the relevance of documents.
In order to find optimal documents we have therefore simply to find documents
with the highest ranking values according to .
This quality definition is used because the development of estimation functions similar to
(1) is not the focus of this paper but has been examined thoroughly in the
field of information retrieval.
In this paper we show that even if an optimal estimator for the quality of documents
is given (or assumed) the optimization of bounded continuous search queries is
not a trivial problem.
Based on the previous definitions we may now define a common strategy to
process bounded document filtering.
In this work we consider a PE method that applies function (1) as the ranking or
evaluation function.
Obviously a PE method is a bounded filtering method according to
definition 1
due to the bounding condition, which may e.g. imply a
maximal number of pages returned in each evaluation period or a total maximal
number. In the latter case the number of returned pages per evaluation period is
determined by (the closest integer to) the total number of documents to be returned according
to the bounding condition divided by , the number of evaluation periods.
A PE-query may e.g. inquire about the 'best 10' pages each day with respect to a set of query terms, as e.g.
realized by the GoogleAlert system [16].
The PE-method is illustrated in figure .
In the figure ''-symbols denote ranking values depending
on the current state of a specific data source or a new document in a stream,
a query profile and a ranking function (as e.g. function (1)).
In this case the query execution time
is . There are two evaluation periods. The bounding condition is 4, i.e. the best two
documents in each evaluation period have to be selected and forwarded to a user
as indicated by circled ranking values.
2 The freshness/quality tradeoff
In this section we demonstrate cases where the PE strategy is sub-optimal and thereby
illustrate the tradeoff-problem between freshness of information
and the quality of retrieved results.
It is obvious from figure that by applying the PE strategy
documents are returned with a certain delay between the point in time
when a document is obtained3
and the time at the end of an evaluation period when a document or a respective notification
is forwarded to a user.
We may therefore define a freshness (or reciprocal: delay) metric as:
where
are the end points of
the evaluation periods
(figure ). At these points in time results are sent to a user.
is the number of evaluation periods and is the number of requested
(estimated best) objects.
is the number of optimal elements to be selected
in the n-th evaluation period
(with
)
and is the j-th best element selected in
the n-th evaluation period.
It may now be shown that a PE-method is not optimal if a high freshness is required.
Theorem 1
The PE-method may choose sub-optimal documents if a high freshness value is required, i.e. if
.
The validity of this theorem may be demonstrated by considering
the example in figure 1.
If the best documents have to be selected that appear during the query execution time
the optimal strategy is to store documents and to wait until .
At this point the 4 highest ranked documents may be returned if
we assume that the bounding condition implies a number of 4 documents to be returned.
However, as shown in the example in figure 1 the delay-value as defined above
may be significant.
href="#foot55">1
In order to acquire fresher results, a larger number of evaluation periods has to be considered.
In figure 1 the query period is subdivided
into 2 evaluation periods.
The bounding condition in this case is 2 documents for each
of the two evaluation periods in order to fulfill the global bounding condition
of maximal 4 documents.
The freshness of retrieved optimal documents is obviously increased.
However the selected documents (as illustrated by circled '
'-symbols)
are no longer the optimal ones and represent a suboptimal choice.
Figure:
The PE(n) strategy: A number of best items according
to the bounding condition is returned to a user after each of the n evaluation periods.
|
The reason for this decrease of retrieval quality is the missing knowledge about
future document rankings if objects are evaluated at an earlier point in time.
This is obviously an intrinsic problem if the optimization of bounded continuous search
queries is concerned. There is no method that has information about future data objects
and therefore each conceivable method is subject to this problem, which we denote
as freshness/quality tradeoff.
lemmaLemma
It has to be noted that this tradeoff-problem is not valid for threshold-based filtering methods.
In the example in figure we wouldn't have the restriction of the maximal number
of objects to be returned and could forward each object above a specified threshold.
However in this case not knowing future ranking values, a suboptimal
threshold may be chosen, which affects precision and recall results.
3 A query method for bounded continuous search (BCS) queries
1 The query language 'BCSQL'
In this section we describe the main syntax of a new query language to state
bounded continuous search (BCS) queries and in subsequent sections we describe how these
queries are answered within our prototype system.
At a high level, we employ a query model similar to the OpenCQ language [23].
In OpenCQ
a continuous query is a triple
consisting
of a normal query (e.g. written in SQL), a trigger condition
and a termination criterion .
In this work we consider only time-based trigger conditions.
We extend the basic notation of OpenCQ in order to support continuous
search queries.
For this purpose we assume the availability of a ranking function for query results as provided by (1).
A main extension with respect to many continuous query languages is the possibility
to provide a bounding condition.
In the considered query language a user has to define the number of estimated best results
to be returned.
This feature is well-known from common search engines.
The best 'n' results are displayed on the first result page.
A further specific attribute is the requirement to specify a user profile consisting of
query terms.
An example for the considered query language is the following:
- CREATE BCSQ:
- SalesWatch as
- Query:
- SELECT ESTIMATED BEST 10
FROM SERVER www.ebay.com
WHERE query='camera 12 mega flash'
- Trigger:
- 60 minutes
- Start:
- now
- Stop:
- 7 days
- Delay:
- 0 minutes
In this query the user requests the best documents on the server www.ebay.com
over a period of 7 days with respect to the query terms 'camera 12 mega flash'.
The trigger condition in this query language is used to define the incoming document stream
in a pull-based manner. In this example
the data source is reloaded every hour.
Since the user in this example wants to buy the respective camera, she is interested
in an immediate notification if relevant pages appear.
The delay parameter (Delay=0) indicates that results should be
delivered immediately after detection on the
Web.5
By the 'BEST 10' directive she may limit the number of irrelevant
pages returned by the query engine.
The 'ESTIMATED BEST' directive in the query denotes that, given an appropriate
ranking measure and estimation method, the query engine should
estimate and return the best documents.
In general it is not known, if versions of a data source that appear in the future
have a higher ranking and thereby declassify the current version as (relatively)
irrelevant.
The current version would thereby create 'costs' in terms of
information overload and a decreased 'precision', if returned to a user.
In the example the current version may contain the terms 'camera 12 mega' but a future
version may contain the terms 'camera 12 mega' and 'flash' which declassifies the current version.
However if the query engine waits until all versions have been available, the respective cameras
may already be sold.
In the following we refer to this query language as
bounded continuous search query language (BCSQL).
2 Answering queries: selecting the best k
In this section we give an introduction into the considered optimal stopping problem,
frequently denoted as 'Secretary Selection problem' (SSP).
We first summarize results from the literature that are the basis for the optimization
method in this paper.
In the classical SSP a sequence of ranked objects is presented to a 'player'.
The player has the task to choose the best object.
The choice is based only on previous observations.
The ranking values of the objects are assumed to be distinct and equally distributed.6
An object has to be chosen immediately when presented to the player
and may not be chosen later.
This basic problem has been analyzed e.g. in [14]
and
[12].
A well-known strategy for this problem is to observe a number of
candidates without choosing them.
The respective ranking values of candidates are stored.
After this observation period the first subsequent candidate is chosen that has a higher
ranking than the maximal ranking value of the candidates in the observation period.
The main problem then is to find an optimal
length of the observation period.
An optimal strategy for this problem in order to maximize the
probability of finding the best candidate is to choose an observation
period of , where is the number of candidates and
is the Euler number.
In other words
approximately one third of the candidates should be observed without being chosen.
This result has been proved 'in the limit', for
.
Further strategies for the basic SSP are discussed in [19].
Extensions of the basic SSP have been proposed
in [14] and
[30].
In contrast to the problem of selecting one single best candidate, in this paper we
consider the more general problem of selecting the best candidates in a stream
of ranked documents by choices, we denote as k-SSP.
An obvious extension of the single SSP is not to consider a single observation
period (needed to adjust the optimal selection probability) but to consider
observation periods.
Our method, following an approach in [15],
first implies the choice of
starting times
.7
After rejecting the first
candidates, the first
candidate considered for selection is examined at or after time .
(1) If candidates have already been examined with objects accepted and rejected,
the object is chosen if it is at least better than one of the objects
already selected. It is rejected if it is worse than at least one of the objects
rejected.
(2) If among all the candidates examined so far the is ranked
(between the accepted and the rejected objects)
it is chosen if
and rejected if
, where is the current point in time.
(3) If choices have been made where and candidates are left
with respect to the entire sequence of input candidates to be evaluated,
all of the remaining candidates must be chosen in order to
guarantee that objects are chosen.
In this paper we do not provide a proof for the previous strategy but in the experiments
the algorithm is evaluated with artificial and real relevance sequences.
Figure:
A strategy for selecting the best two candidates in a stream
of documents. Rejected candidates are marked by rectangles, accepted candidates by circles.
|
An example is shown in figure . We denote the sequence
of candidates as
appearing at times
respectively. We consider a number of 2 candidates to be returned and
two starting points
and
.
Candidates and are rejected due to the first observation phase.
Candidate at is accepted because it is better than all of the previously
rejected candidates. is better than all the previously rejected candidates and worse
than all the previously accepted candidates. It is rejected because it appears before the
stopping time . It would have been accepted if
.
is accepted because it is better than at least one previously accepted candidate.
Due to the previous choice of two candidates, candidates and are not
considered.
Based on this selection strategy the main problem is to find optimal times
in order to maximize the probability of
choosing the best candidates.
Due to the equal distribution of ranking values intuitively the starting times should
be spread evenly over the considered time period.
In [15] a strategy is proposed to position starting times that
is proved to be optimal and applied in section .
In the SSP as in the BCSQL optimization problem the candidates or versions of the
data source appear sequentially ordered one after another.
There exists a definite starting point and a definite endpoint in the BCSQL problem.
In the SSP the starting point is determined by the time of the appearance of the
first, the endpoint by the appearance of the last candidate.
The trigger condition in the BCSQL corresponds to the considered candidates in the SSP.
Each candidate is assigned a ranking value in the SSP.
In the SSP the ranking values are assumed to be distinct. In the BCSQL problem
this property depends on the applied ranking function and may not be fulfilled
(especially if the data source did not change between 2 trigger executions).
The condition of different ranking values may be guaranteed
artificially by considering ranking values that depend on time, i.e. versions appearing later
in the sequence are assigned a lower ranking value. In the SSP as in the CQ problem the selection strategy may be based only on previous
observations. No information about future objects is available.
In contrast to the general BCS query language in section the delay
parameter is not adjustable if the SSP is applied to the optimization
of retrieval results. Results are returned immediately (delay=0)
if estimated to have a high ranking.
4 A query engine for BCS processing
Figure shows the basic steps of the
BCS query processing algorithm.
The input of the algorithm are the start and the end time of the continuous query,
the trigger condition, a value '' for the number
of estimated best items to be chosen and a query profile .
Based on the start, the end time and the trigger condition in steps 1 and 2
the number of reload operations (i.e. the number of 'candidates') and the times
of reload operations are computed.
Applying the k-SSP strategy in section the starting times
are computed based on the number of
candidates and the number 'k' of highly ranked candidates to be chosen.
At time the first candidate is loaded in step 7 and the ranking with
respect to the search query '' is computed (section 2.1).
The ranking is compared to previous ranking values in step 9 which are available
in the list and the relative ranking is computed.
In step 10 it is determined if a new version is chosen as a highly ranked
candidate according to section .
In figure we assume the availability
of a function isSelected(C) that indicates, if a candidate C
has been selected.
In step 11 the new candidate C is inserted into the list rankList
at the position determined by the ranking value.
If the candidate is chosen, a message is sent to the client.
Finally the algorithm waits until the time of the subsequent reload time
in step 13 and returns to step 6.
BCS-Query-Processing (Input: start-time s, end-time e, trigger-condition tc, 'number of best choices' k, Query Q)
- 1
- rankList := null
- 2
- compute number of candidates based on s,e,tc
- 3
- compute reload times
based on s,e,tc
- 4
- compute starting times
based on ,
- 5
- wait until
- 6
- for(i = 1,...)
- 7
- load candidate
- 8
- compute ranking based on ,
- 9
- compare to previous rankings
- 10
- select or reject according to
- selection strategy )
- 11
- insert
into rankList
- 12
- if( isSelected(C) ) send message to client
- 13
- wait()
4 Experiments
In the following experiments we compare the new BCS query method to the
period evaluation (PE) method.
The considered quality parameters are the freshness of the retrieved information according
to eq. () and the quality of search results according to definition .
Applying the k-SSP method (figure ) objects that are estimated to be
relevant are returned to a user immediately after detection on the Web.8
In this case we assume an immediate decision
of the filtering method and the delay value in formula () is 0.
In definition we defined the quality or relevance of a single document
retrieved by a search engine. In order to measure the quality of a set of retrieved
documents we build the sum of quality values of the individual documents.
In [20] a very similar relevance measure is presented that is based
on graded relevance assessments (in contrast to binary relevance assessments
usually considered in IR).
In [20] the functions
for the graded recall (gr) and
the graded precision (gp)
are proposed, where
denotes the entire set of documents and relevance is a
function providing relevance values for
documents, retr is the set of retrieved documents.
In the experiments we apply the same measures
and define the relevance function according to definition .
In the experiments we work with simulated and real data. In the k-SSP method a special
distribution of ranking values, in particular an equal likelihood of each new ranking value,
is assumed.
Real data sometimes are not distributed like that.
Figure:
Relevance evolutions of the source 'www.washingtonpost.com'
over a period of 60 days with respect to the queries 'oscar' and 'basketball'.
|
As an example, figure shows two examples for the relevance evolution
of two queries ('oscar' and 'basketball') over a period of 60 days according
to the quality definition in (1).
The reason to consider generated data in the following is to show the basic
functionality in principle.
Real Web data are considered to show that the presented filtering method based on
k-SSP selection may be applied to distributions of relevance developments
of real information.
In this paragraph we demonstrate experiments with simulated data
in order to analyze statistical properties of the presented BCS method
compared to the PE method.
The main advantage to consider simulated data is a simple and exactly
known distribution of input data which helps to illustrate
main properties of the new method.
In these experiments sequences of distinct ranking values of candidate
size with identical likelihoods are generated. An individual sequence is gradually provided as an input to
the BCS and PE algorithms.
Figure:
Dependence of the mean values of retrieval results (graded recall and precision) on the number of chosen best objects '.
|
In the experiment illustrated in figure
a number of 50 sequences of (distinct) ranking values (in
)
of candidate size were generated.
Figure shows the respective mean values of
acquired retrieval qualities (gr and gp according to ())
for different values of , the number of objects to be returned.
The PE strategy is applied to the data based on
a single evaluation period (PE(1)) of length 99, where the time span between the
appearance of two candidates is assumed to be 1.9
These retrieval results are optimal
since the PE(1) method has knowledge of the
whole distribution of (previous) retrieval values.
The lower graph shows the retrieval quality of the BCS method.
The graph below shows the retrieval quality of the random method
that chooses a number of arbitrary candidates.
Figure shows that the BCS method provides significantly
better results than the random strategy. As expected the quality is lower
than the quality provided by the PE(1) method which has access to the entire set
of ranking values.
In the experiment shown in figure
the retrieval quality of the BCS strategy is lower than the retrieval
quality of the PE(1) strategy. However the BCSQL results are returned
to a user immediately while the PE(1) strategy returns results at the end
of a single evaluation period which is in this case identical to
the query period.
If fresher results are requested when using the PE method obviously
a larger number of evaluation periods has to be considered during
the query execution time. We proportion
requested items to the number of evaluation periods.
If
the selected candidates are proportioned
with an equal likelood to the evaluation periods.
In the following we consider the tradeoff between retrieval quality and freshness
of data.
Figure:
Comparison of PE and BCS strategy: the intersection (IS) of BCS and PE recall defines the maximal number of PE intervals (evaluation periods) where the retrieval quality of PE(n) is better than the retrieval quality of the
BCS strategy.
|
In this experiment and . We consider the mean values of
200 generated sequences.
Figure (left) shows the development of the graded recall
for the PE(n) strategy when the number of PE intervals is increased from
1 to N, the number of versions over time (curve ).
Due to a constant
k ( in def. ())
it is sufficient to consider the graded recall. Similar results would have been
obtained by considering the graded precision.
If the number of evaluation intervals is N, the retrieval quality
of the PE strategy (PE(N)) is obviously
similar to a strategy where candidates are selected randomly ( in
figure ): Since the number of candidates is ,
the average graded recall of the random strategy is
(according to ()).
In figure (left) graph shows the
retrieval quality of the BCS strategy
and graph shows
the retrieval quality of the PE strategy when only a single evaluation period
is considered (PE(1)).
Both strategies do not depend on the the number of observation
periods (x-axis).
The intersection point of the fitting lines of BCS and PE(n) strategy (IS)
defines the (x-axis)-point ( in figure (left))
of the maximal number of evaluation periods where
the graded recall of the PE strategy is better than the graded recall of
the BCS strategy. I.e. if the number of intervals is further increased because
fresher results shall be returned by the PE strategy, the retrieval quality
is lower than the retrieval quality of the BCS method.
Below the intersection point IS in figure (left) the PE
strategy therefore becomes inferior to the BCS strategy. The BCS
strategy provides maximal freshness due to an immediate delivery of results.
In this situation also the retrieval quality is superior to
the PE method in a probabilistic sense.
In order to quantify this situation, in figure (right)
we consider the delay of the considered strategies according to definition ().
The data points close to denote the delay of results
of the PE strategy considering a single evaluation period (PE(1)). The figure shows
that results are delivered with a delay of approximately 0.550% of the
entire query execution time. If e.g. the query execution time is 40 days,
results are returned with a delay of 20 days.
Curve shows the delay of the PE(n) method where the number
of evaluation periods is increased from 1 to N.
Obviously the delay converges monotonically to 0.
The main point in this graph is the y-value of the PE-delay
(figure (right))
where the x value () corresponds to the x-value
of the intersection point IS in the figure on the left.
This point marks the minimal delay of the PE strategy where
the retrieval quality is better than the retrieval quality of
the BCS strategy. In other words: If results are requested by
a user that are fresher than , a user should prefer
the BCS strategy presuming he wants to acquire maximal retrieval
quality. Otherwise, if less fresh results are sufficient, the PE strategy should be applied.
We denote as -turning point in the following.
The -turning point is obtained by a local linear fit
of the PE and BCS recall close to the intersection point IS
in figure (left)
and by computing the intersection point of
the respective PE and BCS fitting lines (close to and
in figure , left).
Then the point in the PE-delay graph (the intersection of the
-value and the PE-delay graph) has to be extracted.
In the example in figure (right) BCS should be
used if results are requested that are fresher than
3.3% of
the query execution time.
If the entire query period is 40 days, the BCS strategy should be used
if results should be fresher than 1.3 days.
Figure shows the -turning point
for different values of the number of candidates and the number of
best values to be returned .
It may be observed that the -turning point tends to decrease for higher
values of . The image on the right suggests that, provided that is sufficiently high,
is relatively constant.
2 Real information sources
In the following experiments we applied the BCS strategy to
data sources on the Web. In particular we considered the homepages of diverse
newspapers in English or German.
Before the experiments we first created a Web archive over a period of a quarter of a year.
By using a Web crawler
at regular points in time (twice a day) a mirror of the sources was
obtained and stored periodically.
Based on the obtained Web archive we extracted a number
of query terms. These query terms were the most frequent terms in the archive
not contained in the list of stop words.
Thereby 480 German and 480 English query terms were extracted.
In the following experiments we considered BCS queries of the following
structure:
Query: SELECT ESTIMATED BEST d
FROM PAGE url
WHERE query='singleterm'
Trigger=9h and 17h, Start=now, Stop=80days
Delay=delay
We applied the queries to the Web archive; the trigger condition corresponds
to the versions available in the Web archive.
The delay-parameter is 0 for the k-SSP estimation method and 80 days for the PE(1) method
(single evaluation period).
Table:
Evaluation results (k=4)
|
BCS |
PE(n) |
PE(1) |
turning point |
|
gr |
gr depend. on maximal delay |
optimal gr |
|
source (language) |
|
2 days |
4 days |
8 days |
12 days |
|
(in days) |
news.bbc.co.uk (e) |
0.387 |
0.135 |
0.246 |
0.408 |
0.506 |
0.595 |
7.9 |
www.faz.net-s-homepage.html (g) |
0.374 |
0.107 |
0.245 |
0.354 |
0.443 |
0.548 |
8.8 |
www.ftd.de (g) |
0.317 |
0.095 |
0.23 |
0.319 |
0.408 |
0.5 |
7.9 |
www.latimes.com (e) |
0.24 |
0.067 |
0.104 |
0.238 |
0.3 |
0.404 |
8.4 |
www.nytimes.com (e) |
0.231 |
0.094 |
0.159 |
0.248 |
0.321 |
0.424 |
7.4 |
www.spiegel.de (g) |
0.326 |
0.104 |
0.167 |
0.328 |
0.404 |
0.5 |
7.5 |
www.sueddeutsche.de (g) |
0.307 |
0.099 |
0.157 |
0.267 |
0.351 |
0.463 |
10.7 |
www.usatoday.com-news-front.htm (e) |
0.307 |
0.093 |
0.153 |
0.277 |
0.372 |
0.542 |
11.2 |
www.washingtonpost.com (e) |
0.121 |
0.074 |
0.113 |
0.167 |
0.223 |
0.307 |
5.1 |
www.welt.de-chl-20.html (g) |
0.178 |
0.057 |
0.076 |
0.155 |
0.198 |
0.296 |
10.6 |
www.wienerzeitung.at (g) |
0.332 |
0.118 |
0.154 |
0.322 |
0.439 |
0.605 |
9.6 |
www.zeit.de (g) |
0.276 |
0.074 |
0.134 |
0.283 |
0.361 |
0.479 |
7.5 |
normalized mean value |
57% |
18% |
31% |
58% |
73% |
100% |
|
Table shows a representative fraction of these experiments.
The table shows the retrieval quality (graded recall)
of the BCS, the PE(j) and the PE(1) strategy for 12 Web pages,
7 in German and 5 in English.
We consider a number of the 4 best objects to be
chosen (). Each entry in the table is the mean value of the
respective quality parameters of all considered (480 German and 480 English) queries.
In these experiments we specified a maximal delay of returned results of
2, 4, 8 and 12 days and adjusted the number of evaluation intervals in the
PE(j) method respectively. The table shows the resulting graded recall (gr) values
of the different methods and the turning point (in days) for each source.
As expected, the retrieval quality of the new BCS method
is smaller than the quality of the PE(1) method which is the maximal retrieval
quality with respect to the number of retrieved pages.
If a higher freshness of results is requested,
the retrieval quality of the PE(n) method
decreased in the experiments.
The value (in days) in table marks the maximal
freshness where a user obtains the best results by the PE(n) method.
Below this point the BCS strategy provides results of a higher retrieval quality.
The last row of table shows the mean values of all data sources standardized
by the maximal recall value provided by the PE(1) strategy.
Figure shows the -turning point
for exemplary data sources and the mean value of 14 data sources
with respect to the number of best items to be returned.
The mean value of the values (e.g. for k=8) is approximately 7.3
days. This is 9.1% of the query execution time of 80 days.
Query systems are available to automate and simplify similar
search problems which are known as continuous, monitoring, notification, alert or
information dissemination services.
Many publishers provide e.g. table-of-contents or
alert functions,
such as ACM Table-of-Contents Alert [1],
Springer Link Alert [35] or
Elsevier Contents Direct [8].
Independent mediating alerting services like Hermes [11] or
Dias [22] provide access to heterogeneous digital libraries.
Query languages for continuous queries are well-known in the field of active databases
[6], [5], [33].
In this field the event-condition-action model (ECA)
is used to define standing queries to databases.
Every time the event occurs, a trigger condition is tested. The testing result may
cause the execution of the defined action.
The respective information is assumed to be structured.
In many information dissemination systems, too [9], [13],
the considered query languages concern structured or semistructured data
[29], [17], [27],
[9], [10], [36], [28].
In [23] and [24] continuous query languages
for information on the Web
are presented that are more appropriate in a Web context and allow e.g. the evaluation of
requests to Web forms.
Although the basic syntax of the query language considered in this paper is similar to many
of the previous languages,
we focus on unstructured data, in particular documents extracted from Web pages,
similar to the approaches in [16] and [26].
The respective task to extract relevant documents from a stream of documents
is well-known from the TREC-filtering track
[31], [7], [4], [41],
[37], [40], [25]
and the field of topic detection and tracking
[2], [3],
[38].
In the filtering track of the Text Retrieval Conference (TREC) [18]
streams of documents are considered.
The task is to optimize methods that realize an immediate distribution
of relevant documents.
A binary decision is made to either accept or reject a document as it arrives.
The information such classifiers are based on is usually a set of training
examples, i.e. documents provided with a relevance label and possibly a
topic description.
This information is used to create a query profile which is applied to estimate the
relevance of future documents by a distance computation between the generated
query profile and a new document.
The decision to either accept a document as relevant or not is finally based on
a threshold value which may also be learned by the training examples.
In the 'adaptive' filtering track [18] the query profile or
the threshold are tuned by feedback provided by a user after the appearance of
a new document.10
Similar to the TREC filtering track the field of topic detection and tracking (TDT)
[2],
[3],
[38] deals with the
problem of finding relevant documents in a document stream.
In this case the classifications are based on a significantly smaller training set and tracking
(of events or topics) should start immediately, which is more appropriate for real
applications.
These previous filtering or tracking methods are usually threshold-based [37].
The returned information load is 'unbounded' according to
definition 1.
In contrast to this in order to control the amount of information returned by a query engine
without the need of further user interaction in this paper we consider
bounded continuous search queries.
Although bounded query strategies are well-known and applied in current
Web search systems [16], [26],
to our knowledge the quality/freshness tradeoff has not been thoroughly examined
for bounded continuous search queries.
Following approaches developed in the field of optimal stopping
[14],
[12], [19], [15],
[30], [34] we develop a new solution for the optimization
of bounded continuous search queries.
In this paper we consider continuous search queries as a means
to search for information appearing in a specific Web area over a
period of time.
Assuming a query profile and a distance measure between profile
and documents there are two basic strategies to process continuous search queries.
A first strategy is to adjust a quality threshold in order to extract relevant documents.
The second strategy is to estimate the best 'k' documents that appear
in the document stream.
In this paper we focus on the latter query method which we denote
as bounded continuous search. The main advantages to consider bounded
queries is a simple query formulation since no threshold (except the
maximal amount of information to be returned) has to be provided.
Second, there is no risk for a user to spend too much time reviewing the
documents or to overlook important documents because of an information overflow.
On the other hand if bounded continuous search queries are concerned
there is a tradeoff between freshness and quality of the retrieved information.
In this paper we show that this freshness/quality tradeoff may lead to
suboptimal choices of documents if very fresh information is required.
We show in experiments that in this case by applying optimal stopping theory
the quality of retrieved information may be improved significantly.
Optimal stopping is a problem well-known in the field
of financial mathematics [34].
The results of this paper indicate that, considering charts of the relevance
of document versions, further instruments from the field of financial mathematics may be
applied to improve continuous search queries.
-
- 1
-
ACM Table-of-Contents Alert.
URL: http://portal.acm.org, 2006.
- 2
-
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang.
Topic detection and tracking pilot study: Final report.
In Proc. of the DARPA Broadcast News Transcription and
Understanding Workshop, pages 194-218, 1998.
- 3
-
J. Allan, R. Papka, and V. Lavrenko.
On-line new event detection and tracking.
In SIGIR '98: Proc. of the 21st annual international ACM SIGIR
conf. on Research and development in information retrieval, pages 37-45,
New York, NY, USA, 1998. ACM Press.
- 4
-
A. Arampatzis and A. van Hameran.
The score-distributional threshold optimization for adaptive binary
classification tasks.
In SIGIR '01: Proc. of the 24th annual international ACM SIGIR
conf. on Research and development in information retrieval, pages 285-293,
New York, NY, USA, 2001. ACM Press.
- 5
-
A. Arasu, S.Babu, and J.Widom.
The cql continuous query language: semantic foundations and query
execution.
VLDB J., 15(2):121-142, 2006.
- 6
-
S. Babu and J.Widom.
Continuous queries over data streams.
SIGMOD Rec., 30(3):109-120, 2001.
- 7
-
K. Collins-Thompson, P. Ogilvie, Y. Zhang, and J. Callan.
Information filtering, novelty detection, and named-page finding.
In TREC 2002. Gaithersburg, 2002.
- 8
-
Elsevier contents direct.
URL: http://www.sciencedirect.com/, 2006.
- 9
-
P. T. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermarrec.
The many faces of publish/subscribe.
ACM Comput. Surv., 35(2):114-131, 2003.
- 10
-
F. Fabret, H. A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, and D. Shasha.
Filtering algorithms and implementation for very fast
publish/subscribe systems.
In SIGMOD '01: Proc. of the 2001 ACM SIGMOD international conf.
on Management of data, pages 115-126, New York, NY, USA, 2001. ACM Press.
- 11
-
D. Faensen, L. Faulstich, H. Schweppe, A. Hinze, and A. Steidinger.
Hermes: a notification service for digital libraries.
In JCDL '01: Proc. of the 1st ACM/IEEE-CS joint conf. on digital
libraries, pages 373-380, New York, NY, USA, 2001. ACM Press.
- 12
-
T. Ferguson.
Who solved the secretary problem?
Statistical Science, 4(3):282-289, 1988.
- 13
-
P. Foltz and S. Dumais.
Personalized information delivery: an analysis of information
filtering methods.
Commun. ACM, 35(12):51-60, 1992.
- 14
-
P. Freeman.
The secretary problem and its extensions: a review.
Int. Statistical Review, 51:189-206, 1983.
- 15
-
K. S. Glasser, R.Holzsager, and A. Barron.
The d choice secretary problem.
Comm. Statist. -Sequential Anal., 2(3):177-199, 1983.
- 16
-
Google alert.
URL: http://www.googlealert.com, 2006.
- 17
-
A. Hinze.
Efficient filtering of composite events.
In Proc. of the 20th British National Database Conf., 2003.
- 18
-
D. Hull.
The TREC 6 filtering track: Description and analysis.
In The Sixth Text Retrieval Conf. (TREC-6), pages 45-68.
Gaithersburg, 1997.
- 19
-
R. Kadison.
Strategies in the secretary problem.
Expo. Math., 12(2):125-144, 1994.
- 20
-
J. Kekalainen and K.Jarvelin.
Using graded relevance assessments in IR evaluation.
J. of the American Society for Information Science and
Technology, 53(13), 2002.
- 21
-
J. Kendall and K. Kendall.
Information delivery systems: an exploration of web pull and push
technologies.
Commun. AIS, 1(4es):1-43, 1999.
- 22
-
M. Koubarakis, T. Koutris, C. Tryfonopoulos, and P. Raftopoulou.
Information alert in distributed digital libraries: The models,
languages, and architecture of DIAS.
In ECDL '02: Proc. of the 6th European Conf. on Research and
Advanced Technology for Digital Libraries, pages 527-542, London, UK, 2002.
Springer.
- 23
-
L. Liu, C.Pu, and W.Tang.
Continual queries for internet scale event-driven information
delivery.
Knowledge and Data Engineering, 11(4):610-628, 1999.
- 24
-
L. Liu, C.Pu, and W.Tang.
Webcq-detecting and delivering information changes on the web.
In CIKM '00: Proc. of the 9th int. conf. on Information and
knowledge management, pages 512-519, New York, NY, USA, 2000. ACM Press.
- 25
-
R.-L. Liu and W.-J. Lin.
Adaptive sampling for thresholding in document filtering and
classification.
Inf. Process. Manage., 41(4):745-758, 2005.
- 26
-
Windows live alerts.
URL: http://alerts.live.com/Alerts/Default.aspx, 2006.
- 27
-
G. Mühl.
Large-Scale Content-Based Publish /Subscribe Systems.
PhD thesis, Darmstadt University of Technology, 2002.
- 28
-
S. Pandey, K. Ramamritham, and S. Chakrabarti.
Monitoring the dynamic web to respond to continuous queries.
In WWW '03: Proc. of the 12th international conf. on World Wide
Web, pages 659-668. ACM Press, 2003.
- 29
-
J. Pereira, F. Fabret, F. Llirbat, R. Preotiuc-Pietro, K. A. Ross, and
D. Shasha.
Publish/subscribe on the web at extreme speed.
In VLDB, pages 627-630, 2000.
- 30
-
J. Praeter.
On multiple choice secretary problems.
Mathematics of Operations Research, 19(3):597-602, 1994.
- 31
-
S. Robertson and I.Soboroff.
The TREC 2002 filtering track report.
In The Eleventh Text Retrieval Conf. (TREC 2002), 2002.
- 32
-
G. Salton and M.J.McGill.
Introduction to modern information retrieval.
McGraw-Hill, 1983.
- 33
-
U. Schreier, H.Pirahesh, R.Agrawal, and C. Mohan.
Alert: An architecture for transforming a passive dbms into an active
dbms.
In 17th Int. Conf. on Very Large Data Bases (VLDB), pages
469-478, 1991.
- 34
-
A. Shiryaev and G.Peskir.
Optimal Stopping and Free-Boundary Problems (Lectures in
Mathematics. ETH Zürich).
Birkhauser, 2006.
- 35
-
Springer link alert.
URL: http://www.springerlink.com, 2006.
- 36
-
T. W. Yan and H.Garcia-Molina.
The SIFT information dissemination system.
ACM Transactions on Database Systems, 24(4):529-565, 1999.
- 37
-
Y. Yang.
A study on thresholding strategies for text categorization.
In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors,
Proc. of SIGIR-01, 24th ACM International Conf. on Research and
Development in Information Retrieval, pages 137-145, New Orleans, US, 2001.
ACM Press, New York, US.
- 38
-
Y. Yang, T. Pierce, and J. Carbonell.
A study of retrospective and on-line event detection.
In SIGIR '98: Proc. of the 21st annual international ACM SIGIR
conf. on Research and development in information retrieval, pages 28-36,
New York, NY, USA, 1998. ACM Press.
- 39
-
B. Yuwono, S. L. Y. Lam, J. H. Ying, and D. L. Lee.
A World Wide Web resource discovery system.
In In Fourth International World Wide Web Conf., Boston, pages
145-158, 1995.
- 40
-
C. Zhai, P.Jansen, E.Stoica, N.Grot, and D.A.Evans.
Threshold calibration in CLARIT adaptive filtering.
In Text REtrieval Conf. 1998, pages 96-103, 1998.
- 41
-
Y. Zhang and J. Callan.
Maximum likelihood estimation for filtering thresholds.
In SIGIR '01: Proc. of the 24th annual international ACM SIGIR
conf. on Research and development in information retrieval, pages 294-302,
New York, NY, USA, 2001. ACM Press.
Footnotes
- ... bound)1
-
A bounded information load is very familiar to users with respect to other media
like television or magazines. Newscasts and other transmissions
on television typically take a well-defined amount of time.
- ...kensall99.2
-
The extension to a stream of document-sets consisting of documents respectively
, where
is a set of (new) documents
,
is not the focus of this paper.
In this case the applied information retrieval measures presented here have to be
modified as described e.g. in [38].
- ... obtained3
- We consider the time needed to compute a ranking
value negligible. Therefore the time a document is obtained corresponds to the x-axis values of ''-symbols
in figure .
- ...4
- Obviously by this method results are returned with a delay of 50% of the query execution time on average assuming equally distributed ranking
values.
- ...
Web.5
- If 'Delay = 1 week' obviously the optimal objects may be
selected (at the end of the week).
If however Delay 1 week, usually only a suboptimal choice is possible.
- ... distributed.6
- If
denotes the ranking position of object with respect to objects
, then the independence assumption is
,
.
- ....7
- We consider discrete times. If e.g. candidates have been rejected and accepted we are at time . A rank of '1' marks the best object.
- ... Web.8
- We
ignore the time to perform the relevance estimation. Objects that are rejected due to the
learning period of the filter affect the quality but not freshness value.
- ... 1.9
- Considering PE(1)
there is a slight decrease of the
graded precision for increasing -value due to definition ().
- ... document.10
- On-line feedback may be simulated by successive
revealing of training data.