Today's enterprise and web content management systems automate the integration of data from heterogeneous sources. For example, Yahoo! Marketplace verticals (e.g. Yahoo! Travel, Local and Shopping) aggregate structured as well as unstructured content from paid feed providers, user submission, as well as web crawling. Human-provided data plays a crucial role in the effective operation of such automated systems. Content aggregation usually includes cleansing and enrichment applications such as attribute extraction, categorization, and entity resolution [4], which we will refer to as ACE:
ACE applications, regardless of whether they are machine learning-based or rule-based, all require human-labeled gold-standard data for bootstrapping (machine learning algorithm training, rule inferencing, etc.), quality-assurance monitoring, and feedback or refinement. Furthermore, human review is often critical for controlling data quality: for example, external data feeds may need to be manually reconciled for errors or missing information; user generated content may need to be moderated for spam and offensive content. There has been research on effectively leveraging human review resources, in particular in the area of active learning, which aims to maximize improvement in the performance of a machine learner using minimal human review resources. However, there has been little study of mechanisms to collect human-reviewed data at a large scale and of the behavior of the respondents providing the data. In industry, a common human data collection mechanism is the use of low-cost contract workers whose activities are coordinated by a locally-run service bureau. These providers generate human-labeled data for many of the aforementioned ACE applications. This review and labeling ecosystem offers a single convenient business interface to the labeling task, but has scalability limits in the incremental and ongoing relationship cost, as well as the latency and throughput of the integrated supply chain management system that governs it. As user-generated content proliferates on the web, websites are devoting increasing resources [2] to moderation and abuse detection. These highly-trained, typically in-house, editorial efforts often have high per-employee throughput. However, scalability is limited by high overhead costs such as recruiting, staff turnover and training.
In contrast, systems such as Google Image Labeler [9] and Amazon Mechanical Turk [1] attempt to scale to the Internet audience at large for the purpose of collecting human-reviewed data. Google Image Labeler, based on the ESP Game [20] [23], offers points with no monetary value as a reward for image tagging tasks with a head-to-head competition model to encourage good answers. Mechanical Turk provides monetary rewards (at least one cent) for tasks, can be used for any generic task, and there is no explicit collaboration between answerers. Beyond these two systems, we can also view general question-answer venues, such as Yahoo! Answers [24], Usenet or discussion forums, as potential technology platforms for the collection of human-reviewed data. Collecting human-reviewed data at Internet-scale has the potential for breaking the throughput bottleneck of in-house or outsourced providers while lowering the cost-per-unit of high-quality human review. For example, it is suggested [20] that an ESP game system with 5000 active users at any given time would be able to label Google's image index in weeks. Two months after launch, Google Image Labeler [9] shows that the top five users have individually labeled over 8,000 images. Google does not incur any explicit per-unit cost in collecting this data, other than the overhead of operating the site. The ESP study [20] shows good results on several quality metrics. In another application, as part of the search for Jim Gray [18], 560,000 satellite images were reviewed by volunteers in 5 days. In this paper we conduct what we believe is the first public study of an Internet-scale human-reviewed data collection mechanism focusing on data quality, task throughput, and user behavior. We also conduct experiments on Yahoo! Answers to study the feasibility of using general question-answering forums for human-reviewed data collection. We survey related work in Section 2. Section 3 presents an overview of the system we are studying as well the tasks and datasets used in our experiments. Section 4 discusses the design and results of our experiments. We repeat some of our experiments on Yahoo! Answers in Section 5. We conclude in Section 6 with substantial evidence that high-quality human-reviewed data can be acquired at low cost at Internet-scale.
As an initial experiment, we posted a qualification test on System M for each of the four applications: hotel resolution, age extraction, brand extraction and model extraction. The qualification tests are taken voluntarily by workers and are unpaid. We included twenty questions in each qualification test, except the age test, which had 21 questions.
In the hotel resolution test, for each of 20 pairs of hotel business records (with name, location, and phone information) given, the worker was asked to choose among five choices: same business, different names but same location, same business but different location, two completely unrelated businesses, or other. For age, brand or model extraction, the worker was given 20 (brand and model) or 21 (age) product description text strings and asked about each. For age extraction, the worker was asked to select from three categories: adult, kids or not applicable. For the brand and model extraction tests, the worker was asked to type in the product brand or model, respectively. Note the question format for the general experiments in Section 4.2 was the same as that of the qualification test, except that the qualification test had 20 or 21 questions, whereas each general experiment task is comprised of a single question.
Table 1 shows the number of participating workers and accuracy results on the four qualification tests. There was significantly more participation in the multiple-choice task types (hotel resolution and age extraction) than the free text task types. Figure 1 delineates the distribution of accuracy scores among the workers who took each qualification test. Table 1 shows that, on average, workers performed best on hotel resolution, followed by brand extraction. Accuracy is significantly lower on age and model extraction qualification tests. Figure 2 shows a horizontal timeline for submission of qualification tests by workers. The starting point of the graph is midnight October 31, 2006, Eastern Time. Each vertical gridline is midnight Eastern Time. The four qualification tests were posted on System M at 5:13PM Eastern Time on October 31. Note the qualification tests were deactivated on System M after the general experiments were done. The time series for the hotel resolution qualification test ends the earliest because the hotel resolution experiments were completed first.
For each of the four applications, we conducted at least one experiment with a qualification restriction and at least one experiment without any qualification restriction in order to determine whether there is an accuracy difference between the presence or absence of qualification restrictions, and between higher and lower qualification cutoffs. For the hotel resolution problem, we conducted a separate experiment that collected five answers per task to contrast with the collection of three answers per task in order to see if collecting more answers per task leads to higher accuracy of the voted answer.
Figures 3, 4, 5, and 6 show
discretized distributions of worker accuracy (aggregate accuracy
of all the answers submitted by a worker on a given task type
experiment) with one graph per application. The worker accuracy
scores are discretized into intervals of width 0.05 ([0,0.05),
[0.05,0.1).., [0.95,1),[1,1]). The y-value (fraction of workers
with accuracy in a given interval) is plotted against the lower
bound of the interval. Figure 7 plots worker accuracy
versus the worker's accuracy score on the corresponding
qualification test. Multiple experiments for the same application
are combined into a single data series.
|
Table 2 shows that, for each application, as we lower or remove the qualification requirement, the number of participants increases and the elapsed time to complete all the tasks decreases. In all experiments, the majority threshold voting scheme (2/3 or 3/5) resulted in higher accuracy than the average accuracy of the underlying answers, thereby demonstrating wisdom of the crowd. The two hotel resolution experiments without qualification but with different voting schemes showed minimal difference in accuracy, suggesting that accuracy is boosted as long as there is some kind of voting. In terms of correlation between answer accuracy and qualification requirement, we had expected that requiring a higher qualification accuracy would lead to higher answer accuracy than requiring a lower qualification accuracy, and that having a qualification requirement would lead to higher answer accuracy than not having one. The hotel resolution, brand and model extraction experiments validated this hypothesis. However, the age extraction experiment without qualification requirement was the only outlier. It showed much higher answer accuracy than the two age experiments with qualification requirements. We believe this is the result of having prolific participating workers who are by chance significantly above-average in accuracy. Figure 4 shows that most of the distribution of workers for the age extraction without qualification experiment is concentrated at very high accuracy region between 0.9 and 1. Turns out these high accuracy workers also were very prolific, therefore biasing the average answer accuracy upward. In fact, the most prolific 20% (4 out of 22) of workers provided 69% of all the answers at 97.1% accuracy. With only 10s or 20s of participating workers in most of the experiments, a few prolific workers with above or below average accuracy could bias the average accuracy on an experiment. We believe our hypothesis does hold, though larger scale experiments are needed. Overall, the accuracy achieved with qualification requirements on the age, brand and model extraction was very encouraging for practical applications. The accuracy on the hotel resolution experiments was lower than expected.
The worker accuracy distributions shown in Figures 3, 4, 5, and 6 look very different from the worker qualification accuracy distributions in Figure 1. Brand is the only similarity, where both qualification and experiment accuracy distributions have most of the mass around accuracy of 0.7 to 0.8. Compared to the corresponding qualification accuracy distribution, the hotel resolution experiment accuracy distributions are shifted toward lower accuracy, while age and model experiment accuracy distributions are shifted toward higher accuracy. Figure 7 indicates that in general, there seem to be little correlation between qualification and experiment accuracy. For hotel resolution, age and model extraction, for similar qualification accuracy, the worker accuracy in the experiments varies from 0 to 0.8. The brand extraction data series suggests a correlation, with its main cluster around worker accuracy range of 0.8 to 1, though there are three outliers with low worker accuracy.
We analyzed the timing of worker submissions to compute hourly pay rates for the workers. Each answer submission comes with an accept timestamp (when the worker accepted the task to work on it) and an submit timestamp (when the worker completed the task and submitted the answer). Given all the answer submissions from a particular worker on a given task type, we derive the time spent by the worker as the latest submit timestamp subtracted by the earliest accept timestamp. The hourly pay rate is computed as the number of answers approved by the threshold voting scheme multiplied by one cent (reward per task) divided by the time spent. Note this computation assumes that the worker is not working on other task types (e.g. from other requesters) concurrently. Table 3 shows, for each task type, the average worker hourly pay rates and the two highest individual hourly pay rates along with the corresponding worker answer accuracy. For each task type, the average hourly pay rate for a group of workers is computed by dividing the aggregate reward paid divided by the aggregate time spent. For each task type, we compare the average hourly pay rate and answer accuracy for all workers, prolific workers (defined as those who answered at least 5% of the 300 tasks), and the two prolific workers with the highest hourly pay rate. For each of the four applications (hotel resolution, age/brand/model extraction), the average pay rate of all workers increase as the qualification becomes less restrictive. As we had discussed in Section 4.2.2 and we can see in the Answer Accuracy column for All Participants in Table 3, overall accuracy tends to decline as the qualification becomes less restrictive. A possible explanation of the pay trend is that as the qualification becomes less restrictive, we see more participants who are not as serious about work quality and are answering the tasks quickly to make money. Comparing pay rate of the prolific workers against all workers, of the 11 task types, in seven cases the prolific workers are paid more and in the rest, the pay is the same. Comparing answer accuracy, in seven cases the prolific workers are more accurate, and in the rest, the prolific workers are just as accurate or marginally less accurate. The rightmost three columns of Table 3 show that the highest paid workers earn significantly more than the average pay rate. They also tend to be prolific (average 41% of tasks answered) and accurate (only 4/22 have lower than average answer accuracy). The top paid prolific worker of all the experiments was someone on the age extraction without qualification task type who averaged 6.53 per hour. This worker answered 236 age extraction tasks within 21 minutes, averaging 5.4 seconds per task, at an above average 97% accuracy. Overall, we see that the vast majority of the workers participating in the experiments earned significantly below minimum wage rates, while very few prolific workers on a subset of the task types approached the minimum wage rate.
|
Having seen encouraging results from the user community of System M, can we leverage large existing online social network sites for the collection of human-reviewed data? The first candidate that comes to mind is Yahoo! Answers. Yahoo! Answers is a general question-answer discussion forum organized by a topic hierarchy. In the 11 month since launch, it has accumulated 65 million answers [24]. It is similar to Google Image Labeler in that there is a base of dedicated users (two top users have each answered over 50,000 questions) and the system does not provide monetary reward to users. Two questions come to mind: can we engage the Answers user community to participate in our application? Can the underlying Answers technology platform be leveraged for our application? To answer these questions, over a period of four days, we manually submitted the 81 questions from the four qualification tests from Section 4.1 to Yahoo! Answers. Each question was available to be answered by users for three days.
We received encouraging number of answers: 95% of questions had at least one answer. We manually reviewed each answer to label whether it is spam, whether it is useful and answers the question, whether it is machine parsable, and whether it is correct. Table 5 shows the results of the experiments in detail. The last two rows show the accuracy if we use majority voting on the useful (resp. parsable) answers of each question. A few data points are not in the table: about 1% of the answers were spam, 4% were unhelpful responses such as ``who cares?'' and about 13% of the answers did not directly answer the question (e.g. one answered ``It's an Air Conditioner'' for a brand extraction question). Of interest here are the 8% of answers which contained extra useful information beyond what was asked in the question. For example, on one hotel resolution question, a user actually called the phone number in the question data to determine that the two hotel records presented were the result of a franchise change at the location. On hotel resolution questions, 45% of answers were correct. On extraction questions, age had a 28% answer accuracy, brand had a 51% answer accuracy and model had a 17% answer accuracy.
The biggest challenge with using a discussion forum like Yahoo! Answers for automated collection of human-reviewed data is parsing. The challenge is two fold: separating the useful data from the spam and unhelpful data; secondarily the parsing out of the user's intended response from pleasantries and grammatical ``glue.'' The challenge is largely a consequence of user behavior. Forums like Yahoo! Answers are meant for human exchanges, hence users are used to receiving conversational questions and responding with breezy and off-the-cuff answers. One imagines that this sort of natural language give-and-take provides users with a dimension of confidence in the interrogator and respectively the responder that can only be assessed by a living person. For example, our hotel resolution questions look like the following:
... What do you think is the relationship between the two businesses described by the two records? Is it: A. The two records are about the same business. B. The two records have different names but are at the same location. ...
Of course, it is trivial to parse the verbatim answers, which were the result of cut-and-paste in the browser. The difficulty comes, often, when the user answers were fluent conversational responses, such as: ``Since you didn't give your sources, I am inclined to answer with 'F' '' or ``My guess is C because they are both hotels.'' On the free text questions such as brand or model extraction, simple regular expression templates could potentially handle terse responses such as ``The brand is Panasonic'' or ``It's made by Pickett.'' One can't expect to exhaustively list all such possible text patterns, so this approach has clear limitations. In some cases, we needed to segment the answer into sentences to filter out the irrelevant statements; for example, ``It's a BOSCH. We have a BOSCH and it works great!'' In some cases the user provided multiple answers, for example, ``I'd say either Sanford or Prismacolor??'' . The double question marks indicate responder uncertainty which complicates the answer recovery problem.
In contrast, on System M, multiple choice questions did not present any ambiguity in the divination of user intent as the user was choosing radio buttons in the browser GUI. For free text questions on brand and model extraction, none of the System M workers entered extraneous text. Clearly, the user behavior is very different on System M, as workers are doing a task for the requester in an explicit paid relationship, rather than having a potentially open-ended pro bono question & answer-type conversation with a fellow user. The System M workers expect to be evaluated on their answers and there is explicit monetary payment associated with this evaluation; in contrast, the question & answers system is informal and answerers accrue irredemable, non-monetary ``points.'' On account of the large user base alone, Yahoo! Answers is a promising vehicle for automated collection of human-reviewed data. We saw decent participation from users. On two of the task types (hotel resolution and brand extraction), we saw reasonable answer accuracy of over 45%. The challenge of user answer parsing can be mitigated in several ways with small changes to the underlying technology infrastructure: support multiple choice questions which can be modeled as polls; as part of the question text, explicitly state the answer collection is automated and ask that users do not type in extraneous text; implement clever user interfaces (e.g. for brand/model extraction, require the user to select a substring from the product description text). Applications of general question-answer forums such as Yahoo! Answers, as well as user behavior in such venues, deserve significant research attention.
In this study, we conducted experiments analyzing the data quality, throughput and user behavior of an Internet-scale system for the collection of human-reviewed data. The tasks we experimented with were real content aggregation applications using real-world data. The main contributions of our work are the detailed study using real datasets and the thorough analysis of the resulting data quality and user behavior. Our results show that by applying worker pre-qualification mechanisms, we are able to obtain an 82% accuracy on hotel resolution, 95% accuracy on product age category extraction, 85% accuracy on product brand extraction and 80% accuracy on product model extraction. These quality measures are very encouraging for a wide variety of practical ACE applications, from creating labeled training sets for machine learning algorithms to providing labeled datasets for quality assurance monitoring. We extend discussion of applications of human-reviewed data in Section 6.1. We envision future enterprise and web information integration and content aggregation systems will include wrappers to interface with Internet-based human-reviewed data collection systems, such that the data processing system can push human review requests to the data collection system on demand.
In terms of future work, we would like to investigate more human-reviewed data collection systems and incentive schemes, as well as conduct larger scale experiments with data from more human-data consuming applications. Are some types of tasks more suitable than others for large scale human review? On unstructured systems such as Yahoo! Answers, we would like to study techniques to parse responses. We would also like to study richer interfaces for general users as a solution to the parsing challenge. For example, can the head-to-head collaborative paradigm of the ESP Game be applied to other types of tasks, such as entity resolution or attribute extraction? Another emerging application of interest is feedback and suggestion systems such as Yahoo! Suggestion Board [25]. Like Yahoo! Answers, it poses challenges such as interface design and algorithmic solutions to automatically filter out noise and parse responses.
As discussed in Section 1, human-labeled data is very important for many ACE applications, since humans are the only authoritative source for label data. However, having been sourced from fallible humans makes the label data itself imperfect; a given human label could be incorrect, relative to the universal truth (as opposed to the ``labeled'' truth), for a variety of reasons. Complicating the picture is the condition that so called ``gold standard'' datasets often have but a single data point per label for reasons of efficiency. Furthermore, some labels are inherently ambiguous and subject to interpretation or the relevant context may be missing information for an accurate labeling. For example, in our experiments on System M, given the product description text ``Lakai Men's Soca 2 Shoe'', two workers answered that the model is ``Soca 2'', while one worker answered ``Soca''. Ignoring the gold standard label, in this case it is difficult to determine which is correct given just the product description. The multiple human data inputs merely provide good candidates for valid model names rather than a definitive answer. For the product description text ``adidas Piccolo IV Infants & Toddlers'', the so called gold standard model label is ``adidas Piccolo IV Infants & Toddlers'', which is clearly incorrect since ``adidas'' is a brand name. In contrast, the model label voted by the workers was ``Piccolo IV'', which seems correct. In this case, the collected external human-labeled data can serve to correct our internal human-labeled gold standard dataset. Humans can be used in a feedback loop to validate previous generations of human-reviewed data, resulting in enhanced data quality and reliability. For instance, on System M there are survey-style tasks asking workers to list their top 3 travel destinations. The same requester has a separate set of tasks to validate those answers, asking workers ``are x, y, and z valid travel destinations?''