TaskMate: A Mechanism to Improve the Quality of Instructions in Crowdsourcing

Dwarakanath Jampani, V. K. Chaithanya Manam, Mariam Zaim

ECE 695 - Spring 2017 - Purdue University

Demo Video

Introduction

Crowdsourcing platforms have enabled many requesters to accomplish tasks that require human computation with low cost and satisfactory accuracy. According to the reviews on Turkopticon and prior research conducted, there have been several instances where the worker was not able to complete a required task, or submitted the task with incorrect results due to ambiguous or sloppy instructions. Ujwal Gadiraju et al. have found that task instructions in crowdsourcing platforms are vague, unclear, inconsistent, ambiguous, and imprecise. This often results in the squandering of time and effort of both the requester and worker.

The problem arises due to the following scenario: As a requester attempts to translate an idea in his mind into instructions for a task, he may fail to phrase his task clearly, either due to lack of time or skill. To the requester, the instructions may seem intuitive and straightforward, because he has a clear picture of the expected solution to the task. He expects the workers to understand it from his point of view. However, workers may find it difficult to understand the task, due to lack of prior knowledge or cultural or educational background. Therefore, even a straightforward task, if worded poorly, can seem difficult.

In this paper, we propose a novel mechanism called TaskMate that leverages crowd workers with the goal of improving the quality of instructions for a given task. TaskMate allows workers to collaborate with each other to solve a task, therefore minimizing the effort and time that the requester needs to invest. To achieve this goal, we propose a workflow that divides the overall problem into small, manageable, and verifiable steps.

In this paper we present the framework, user interface and applications of our mechanism. With this work, we hope to find answers to the following research questions: How well can workers produce high-quality instructions? How accurately can they predict the intent of the requester?

Related Work

Researchers in crowdsourcing have built innovative applications for specific use cases that hide the complexity of task specification. For instance, Bernstein et al. proposed Adrenaline, a realtime camera to select the best photographic moment in a short movie with the help of crowdsourcing. Soylent, a powerful word processing tool, has the ability to edit, shorten, and proof-read documents with the help of crowd labour. By adding Soylent as a plug-in for Microsoft Word, researchers were able to hide the complexity of task specification through a simple interface. Turkomatic, CrowdForge, and Crowd4U propose mechanisms to decompose a complex task posted in natural language into smaller tasks suitable for crowdsourcing platforms. These applications are successful; however, their approaches are not generalizable to all the task specifications in crowdsourcing.

Philipp Gutheim et. al. proposed Fantaskit, a system for novice requesters to create successful tasks and receive high quality responses from the crowd. Fantasktic proposed three task design techniques: a guided task specification interface, preview interface, and worker tutorial. A guided task specification interface provides guidelines and recommendations to the requester while creating a task. Preview interface presents the task from the perspective of worker to the requester. Worker tutorial is generated automatically based on the sample answers provided by the requester. Their results show a significant improvement in the quality of response from workers for instructions based on the guided task interface. Task preview and the worker tutorials did not have any impact on the workers response. Fantasktic is proactive task design technique that helps novice workers to create a better task. However, Wing It is a reactive task design technique and could be used by both novices and expert requesters. Wing It provides the possibility of worker-requester interactions which are not provided by Fantasktic.

To improve the quality of work in crowdsourcing, ESP Game proposed use of multiple workers to work simultaneously and their responses are aggregated by voting. Le et al. used qualifying tests called gold standards to preselect qualified users. Peer-review workflow allow workers to rate a sample answer and an iterative workflow allow workers to collaborate and improve answers from previous workers. Sampath et al. proposed cognitive inspired task design which makes task inefficient for workers to provide incorrect responses. Basu and Christensen proposed methods to teach and educate the crowd. Bragg et al. showed that an adaptive teaching strategy for crowd will be more effective than fixed length teaching.

TaskMate

We propose TaskMate, a system that accepts task instructions which may contain ambiguity and then process it with the help of crowd workers, and finally produce a modified version of the original instructions, in which ambiguities have been resolved. The TaskMate framework consists of five stages: 1) Identify 2) Resolve 3) Merge 4) Verify 5) Select. The flowchart in Figure 1.1 displays a visual of this process. Each stage of TaskMate is described in further detail below.
TaskMate flowchart
Figure 1.1 - TaskMate framework

Step 1: Identify
In this first step, the goal is to identify any ambiguities in the task instructions. To achieve this, we employ three workers in this phase. These workers are asked to solve the original task. Our intention is that, as they think about their solution, they are able to come up with various ambiguities or unclear wording in the task instructions. For each ambiguity that the workers identify, we ask them to list possible solutions that can clarify the ambiguity. An example of the UI that the workers see in the Identify stage is shown in Figure 2.1.

TaskMate Identify stage UI
Figure 2.1 - TaskMate Stage 1: Identify

Step 2: Resolve
In this step we employ 3 workers to resolve the ambiguities identified by workers in the Identify step. For each particular ambiguity, they pick the best possible answer out of the list that workers in Identify provided. The data from all three workers is collected and the answers that got highest number of votes for each ambiguity is considered for the next step. An example of the UI that workers see in this stage is shown in Figure 2.2.

TaskMate Resolve stage UI
Figure 2.2 - TaskMate Stage 2: Resolve

Step 3: Merge
At this step, we now have the list of ambiguities, along with the corresponding solutions or answers. Given this information, workers are asked to edit the original task instructions and incorporate all answers, such that the question is more elaborate and explanatory. The UI example for the Merge stage is displayed in Figure 2.3.

TaskMate Merge stage UI
Figure 2.3 - TaskMate Stage 3: Merge

Step 4: Verify
This step employs workers to analyze the discrete edited instructions retrieved from the previous step. For each edited instruction, the workers must select whether the edits are sufficient to answer the ambiguities raised in Identify, and if all answers from Resolve have been incorporated, without changing the meaning of the original question. The workers' data from this stage is collected, and the answers that received only positive votes proceed to the final step. This stage acts as a quality control mechanism to discard low quality solutions. The UI for this stage is shown in Figure 2.4.

TaskMate Verify stage UI
Figure 2.4 - TaskMate Stage 4: Verify

Step 5: Select
This is the final stage of TaskMate. From the list of verified task instructions, workers are asked to vote and select a final task instruction that best captures all missing information, and clarifies any ambiguities. This step finally produces a single high quality task instruction, which is then presented to the requester. The UI example is shown below in Figure 2.5.

TaskMate Select stage UI
Figure 2.5 - TaskMate Stage 5: Select

System Usage

Our system is intended for requesters who aim to post a task on a crowdsourcing platform. With TaskMate, we hope to decrease the number of iterations that a requester has to go through in order to generate a clear set of instructions.

To use our system, the user first enters a task using the UI shown in Figure3 .1. Once it is submitted, the first stage of TaskMate is triggered. From then onwards, the remaining stages are posted to a crowdsourcing platform automatically. The user can monitor the real-time progress of each of these stages via the UI shown in Figure 3.2. Once all five stages of TaskMate are complete, the 'new and improved' task is reported back to the user in the bottom gray bar of Figure 3.2

TaskMate Requester page
Figure 3.1 - Requester page for posting initial task.
TaskMate Progress Page
Figure 3.2 - Task status page for the user to monitor the progress of each stage in TaskMate.

Evaluation Criteria

We intend to evaluate TaskMate based on the following two criteria:

Evaluation of Feasibility

To evaluate the feasibility of our mechanism, we verify whether the final set of instructions given by TaskMate correctly resolves all ambiguities identified in the first step of our framework. This can be quantified with a metric representing the percentage of ambiguities that were resolved.

Evaluation of Accuracy

By comparing the instructions developed by a group of crowd workers against the ground truth data, we can quantify the accuracy of TaskMate's results. This involves ensuring a couple of items:

  1. The resulting instructions from TaskMate capture all the information provided in the ground truth.
  2. The instructions from TaskMate do not contain any statements contradictory to the ground truth.
Qualifications of Participants

Crowd Workers: We impose no restrictions on educational, cultural or language background in the participants that we hire in TaskMate. This follows from our original aim of studying how untrained workers can collaborate to improve a set of instructions. We hope to recruit workers from Amazon Mechanical Turk, who have previously completed atleast 100 HITs, and have a minimum 98% HIT acceptance rate. With these qualifications, we can rely on a group of well performing workers who have enough experience on AMT to be able to predict a requester's intentions in case of ambiguous instructions.

Experts: To develop ground truth datasets, we will use the judgement of ourselves (the three researchers), to develop the gold instructions for each task. Each one of us will come up with a high quality instructions corresponding to the given task, and use majority-vote to create a consolidated set of instructions that best describes a given task.

Study Design

We tested our system by posting five ambiguous, unclear tasks on Amazon Mechanical Turk. Each task underwent the five stages of TaskMate, and produced clear and precise instruction at the end. At each stage, three unique workers were hired. Each worker is only allowed to work on a single stage for a given task. A total of 75 workers participated in our study. Payments made to workers for each stage were as follows:

Output Accuracy

As mentioned in the previous section, five different tasks were tested on Amazon Mechanical Turk. For each task, the workers achieved in identifying a list of ambiguities, resolving them, and generating a clearer set of task instructions. The results obtained are presented below.

Experiment Original Task Improved Task
1 "List URLs where one can buy desktop computers for a classroom." "Please provide a list of URLS where a college computer lab administrator could buy new desktop computers in bulk. I need these addresses to be national, not from suppliers overseas. The computers must be capable of design, 3D modeling, and video editing."
2 "List the top 3 Computer Science departments." "List the top 3 undergraduate Computer Science Departments at colleges or universities in the U.S. I am specifically interested in those focused on computer programming."
3 "Write down the recipe for baking cookies?" "Write down my mother's recipe for baking cookies and email it to me."
4 "Search for the 3 *most popular* movies that are appropriate for kids. Write down the title of the movie, and the names of two actors." "Write down the movie title and two actors of the top three animated movies this year in popularity that are appropriate or 5-10 year olds."
5 "Please provide the contact information for a bakery in Chicago near the airport." "Please provide the contact information for a bakery within five miles of Chicago's O'Hare International Airport."
Table 4.1 - Results of TaskMate obtained from five different experiments.

Output Feasibility

To perform the feasibility analysis, we compare the number of ambiguities resolved in the final version of the task against the number of ambiguities that were identified in the first stage of TaskMate. The data in Table 4.2 suggests that, in most cases, workers successfully resolved all but one of the ambiguities, and incorporated them to produce an improved task.

This not only emphasizes the success of TaskMate, but also shows how a group of workers are able to work iteratively and build upon each other's answer to come up with an improved task instruction.

Experiment Percent Ambiguities Resolved
1 5/6 = 83%
2 5/6 = 83%
3 2/3 = 67%
4 4/6 = 67%
5 2/3 = 67%
Table 4.2 - Percentage of ambiguities that were resolved in each experiment, when generating the improved version of the task.

Runtime Analysis

For five tasks, the average results per task were as follows:

Furthermore, data was collected to evaluate the average time workers spent in each stage of TaskMate. A bar chart is displayed in Figure 4.1 with this information. As can be expected, the "Resolve", "Verify" and "Select" stages were quick to be finished, and only took a few seconds. These three stages involve selection of a few radio buttons or dropdown menus. On the other hand, the first stage, Identify, requires workers to list out as many ambiguities as they can think of, and hence is where most of the time is spent. Also, in the "Merge" stage, workers are asked to come up with a modified version of the original task, which also took a couple of minutes on average.

TaskMate Verify stage UI
Figure 4.1 - Average time (in minutes) spent by a worker in a particular stage of TaskMate.

Discussion & Future Work

This section reviews some fundamental questions about the nature of paid, crowd-powered interfaces as embodied in TaskMate. Our work suggests that workers are able to guess the intent of the requester. In TaskMate, the average task completion time is in the order of minutes. This can be affected by worker demographics, worker availability and the relative attractiveness of work and the amount we pay. In order to reduce the completion time, we can either increase the amount per task or use retainer model suggested by Crowd in Two seconds. In the current system, workers need to identify all the ambiguities in the first stage itself. In future, we can extend TaskMate as a cyclic process by asking worker at each stage where any new ambiguities are found or not. If any new ambiguities are found, we can move back to first stage and continue until all ambiguities are resolved.

We have not used any Natural Language Processing (NLP) techniques to validate the task provided by requester. For future work, we could use NLP initially to identify some of the ambiguities by using word sense disambiguation. Moreover, we could correct grammatical and syntactical errors in the original instruction automatically.

Conclusion

In this paper, we present TaskMate: a mechanism to improve the quality of instructions in crowdsourcing. Taskmate consists of five stages: Identify, Resolve, Merge, Verify, and Select. In Identify stage, each worker will identify the problems with the task instructions and a set of possible solutions by working on the task. In resolve stage, all the ambiguities identified in the previous stage are resolved by choosing the best possible solution to the resolve the ambiguity. In merge stage, workers will write a new task by merging the original instruction along with the list of ambiguities and clarifications. In verify stage, workers will verify whether the new merged task instruction has incorporated all the ambiguities. In the final stage select, workers will vote for the best instruction that does not contain any unclarity. Our results show that workers were able to come up with instructions that are clear and precise. Workers were able to guess what the requester was trying to ask.

Acknowledgements

We would like to thank Professor Alex J. Quinn for his constant support and guidance throughout this project.