Published on

May 26, 2021

Task-oriented dialogue systems could be better. Here’s a new dataset to help.

Derek Chen

Table of Contents

Dialogue State Tracking has run its course. Here’s why Action State Tracking and Cascading Dialogue Success is next.

For call center applications, dialogue state tracking (DST) has traditionally served as a way to determine what the user wants at that point in the dialogue. However, in actual industry use cases, the work of a call center agent is more complex than simply recognizing user intents.

In real world environments, agents are typically tasked with strenuous multitasking. Tasks often include reviewing knowledge base articles, evaluating guidelines in what can be said, examining dialogue history with a customer, and inspecting customer account details all at once. In fact, according to ASAPP internal research, call center phone agents spend approximately 82 percent of their total time looking at customer data, step-by-step guides, or knowledge base articles. Yet none of these aspects are accounted for in classical DST benchmarks. A more realistic environment would employ a dual-constraint where the agent needs to obey customer requests while considering company policies when taking actions.

That’s why, in order to improve the state of the art of task-oriented dialogue systems for customer service applications, we’re establishing a new Action-Based Conversations Dataset (ABCD). ABCD is a fully-labeled dataset with over 10k human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by company policies to achieve task success.

The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests. With this dataset, we propose two new tasks: Action State Tracking (AST)—which keeps track of the state of the dialogue when we know that an action has taken place during that turn; and Cascading Dialogue Success (CDS)—a measure for the model’s ability to understand actions in context as a whole, which includes the context from other utterances.

The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests.

Derek Chen

Dataset Characteristics

Unlike other large open-domain dialogue datasets often built for more general chatbot entertainment purposes, ABCD focuses deeper on increasing the count and diversity of actions and text within the domain of customer service. Dataset participants were additionally incentivized through financial bonuses when properly adhering to policy guidelines in handling customer requests, mimicking customer service environments and realistic agent behavior.

The training process to annotate the dataset, for example, at times felt like training for a real call center role. “I feel like I’m back at my previous job as a customer care agent in a call center,” said one MTurk agent who was involved in the study. “Now I feel ready to work at or interview for a real customer service role,” said another.

New Benchmarks

The novel features in ABCD challenges the industry to measure performance across two new dialogue tasks: Action State Tracking & Cascading Dialogue Success.

Action State Tracking (AST)

AST improves upon DST metrics by detecting the pertinent intent from customer utterances while also taking into account constraints from agent guidelines. Suppose a customer is entitled to a discount which will be offered by issuing a [Promo Code]. The customer might request 30% off, but the guidelines stipulate only 15% is permitted, which would make “30” a reasonable, but ultimately flawed slot-value. To measure a model’s ability to comprehend such nuanced situations, we adopt overall accuracy as the evaluation metric for AST.

Cascading Dialogue Success (CDS)

Since the appropriate action often depends on the situation, we propose the CDS task to measure a model’s ability to understand actions in context. Whereas AST assumes an action occurs in the current turn, the task of CDS includes first predicting the type of turn and its subsequent details. The types of turns are utterances, actions, and endings. When the turn is an utterance, the detail is to respond with the best sentence chosen from a list of possible sentences. When the turn is an action, the detail is to choose the appropriate slots and values. Finally, when the turn is an ending, the model should know to end the conversation. This score is calculated on every turn, and the model is evaluated based on the percent of remaining steps correctly predicted, averaged across all available turns.

Why This Matters

For customer service and call center applications, it is time for both the research community and industry to do better. Models relying on DST as a measure of success have little indication of performance in real world scenarios, and discerning CX leaders should look to other indicators grounded in the conditions that actual call center agents face.

Rather than relying on general datasets which expand upon an obtuse array of knowledge base lookup actions, ABCD presents a corpus for building more in-depth task-oriented dialogue systems. The availability of this dataset and two new tasks creates new opportunities for researchers to explore better, more reliable, models for task-oriented dialogue systems.

We can’t wait to see what the community creates from this dataset. Our contribution to the field with this dataset is another major step to improving machine learning models in customer service.

Read the Complete Paper, & Access the Dataset

This work has been accepted at NAACL 2021. Meet the authors on June 8th, 20:00—20:50 EST, where this work will be presented as a part of “Session 9A-Oral: Dialogue and Interactive Systems.”

Stay up to date

Thank you for subscribing.

Oops! Something went wrong while submitting the form.

About the author

Derek Chen

Derek Chen is a Research Scientist at ASAPP designing intelligent dialogue systems with stronger natural language understanding capabilities. He received his Masters in Computer Science from the University of Washington and his undergraduate degree from UC Berkeley. His research is focused on data efficiency methods including active learning, data augmentation and meta-learning. He is also interested in techniques surrounding uncertainty measurement so that a dialogue agent can better manage ambiguity and out-of-scope situations.

Explore our latest blogs

AI claims in CX: Trust it or trash it? At CCW Las Vegas 2026

Trust or trash the biggest AI customer service claims. Chris Arnold breaks down agentic AI, human-in-the-loop, AI governance, and CX myths.

Learn more

15 best practices for human-in-the-loop agentic CX

15 best practices to build a human-in-the-loop operating model for agentic CX, covering workforce transformation, AI governance, AI Ops, and automation.

Learn more

Is generative AI for customer service ready for scale?

Can generative AI for customer service resolve issues or just deflect them? Use this framework to assess resolution, governance, and ROI. URL: generative-ai-for-customer-service

Learn more

Stay up to date

Task-oriented dialogue systems could be better. Here’s a new dataset to help.

Dataset Characteristics

New Benchmarks

Action State Tracking (AST)

Cascading Dialogue Success (CDS)

Why This Matters

Read the Complete Paper, & Access the Dataset

Stay up to date

Loved this blog post?

About the author

Explore our latest blogs

AI claims in CX: Trust it or trash it? At CCW Las Vegas 2026

15 best practices for human-in-the-loop agentic CX

Is generative AI for customer service ready for scale?