Derek Chen

Derek Chen is a Research Scientist at ASAPP designing intelligent dialogue systems with stronger natural language understanding capabilities. He received his Masters in Computer Science from the University of Washington and his undergraduate degree from UC Berkeley. His research is focused on data efficiency methods including active learning, data augmentation and meta-learning. He is also interested in techniques surrounding uncertainty measurement so that a dialogue agent can better manage ambiguity and out-of-scope situations.

GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation

Derek Chen

Imagine you’re booking airline tickets through a conversational AI assistant, and after purchasing tickets, you ask for help in finding an in-home pet sitter during your trip. The conversational AI misinterprets what you mean, and instead shares details on how to board your flight with pets. This has an obvious reason: the AI has never encountered this particular task, and was unable to map it to a procedure. Thus, your request to find an in-home pet sitter was out of the distribution of what the assistant was trained to handle. Alternatively, suppose you had asked about upgrading your flight, but the system confuses your request as wanting to update your flight to a different date. In this case, the AI assistant is capable of managing flights but was unable to complete the request due to a dialogue breakdown. In both cases, we arrive at the same result: a failed conversation.

Both out of distribution requests and dialogue breakdowns described above are considered out-of-scope (OOS) situations since they represent cases that your assistant is unable to handle. To avoid customer frustration, detecting OOS scenarios becomes an essential skill of today’s conversational AI and dialogue systems. While the ideal conversational AI agent would be able to help find an in-home pet sitter as requested and manage all the complex nuances of natural language, this is simply not possible given that training data is finite and consumer queries are not. So knowing when the user is asking something in-scope vs out-of-scope can help refine conversational AI systems into better performing in their core tasks.

It can be hard to provide training data for, or even enumerate, the potentially limitless number of out-of-scope queries a dialogue system may face. However, new ASAPP research presented at the conference on Empirical Methods in Natural Language Processing (EMNLP) offers a novel way to address this limited-data problem.

Out-of-Scope Detection with Data Augmentation

We introduce GOLD (Generating Out-of-scope Labels with Data augmentation), as a new technique that augments existing data to train better out-of-scope detectors operating in low-data regimes. The key insight is that rather than training on in-scope data alone, our proposed method operates on out-of-scope data as well. Furthermore, we discover that common NLP techniques for augmenting in-scope data, such as paraphrasing, do not provide the same benefit when working with out-of-scope data.

GOLD works by starting with a small seed set of known out-of-scope examples. This small amount (only 1% of the training data) is typically used by prior methods for tuning thresholds and other hyperparameters. Instead, GOLD uses this seed set of OOS examples to find semantically similar utterances from an auxiliary dataset, which yields a large set of matches. Next, we create candidate examples by replacing utterances in the known out-of-scope dialogues with the sentences found in extracted matches. Lastly, we filter down candidates to only those which are most likely to be out-of-scope. These pseudo-labeled examples created through data augmentation are then used to train the OOS detector.

The results? State-of-the-art performance across three task-oriented dialogue datasets on multiple metrics. These datasets were created by post-processing existing dialogue corpora spanning multiple domains with multi-turn interactions. Notably, the out-of-scope instances were designed as a natural progression of the conversation, rather than generated through synthetic noise or negative sampling.

Why this matters

Data augmentation is a popular method to improve model performance in low-resource settings, especially in real life settings where annotating more examples can quickly become cost-prohibitive. With just a small seed of out-of-scope examples, GOLD achieved a 10X improvement in training out-of-scope detectors compared to using the seed data alone. Previous methods relied on using tremendous amounts of labeled out-of-scope data that is unrealistic to obtain in real-world settings or relied on in-scope data alone which doesn’t provide sufficient signal for detecting OOS items.

‍

With just a small seed of out-of-scope examples, GOLD achieved a 10X improvement in training out-of-scope detectors compared to using the seed data alone.

Derek Chen

GOLD supports robustness and prevents overfitting by relying on other methods during the filtering process. As other out-of-scope detection methods improve over time, GOLD can take advantage of those gains and improve as well.

At ASAPP, we are exploring similar methods in our products to both reduce out-of-scope issues in our conversational systems, as well as improve overall systems when operating in limited data regimes. If you’re a researcher conducting work to detect more granular levels of errors, or more sophisticated methods of data efficiency, we’d love to chat! Give us a tweet at @ASAPP.

Read our paper on GOLD

Task-oriented dialogue systems could be better. Here’s a new dataset to help.

Dialogue State Tracking has run its course. Here’s why Action State Tracking and Cascading Dialogue Success is next.

For call center applications, dialogue state tracking (DST) has traditionally served as a way to determine what the user wants at that point in the dialogue. However, in actual industry use cases, the work of a call center agent is more complex than simply recognizing user intents.

In real world environments, agents are typically tasked with strenuous multitasking. Tasks often include reviewing knowledge base articles, evaluating guidelines in what can be said, examining dialogue history with a customer, and inspecting customer account details all at once. In fact, according to ASAPP internal research, call center phone agents spend approximately 82 percent of their total time looking at customer data, step-by-step guides, or knowledge base articles. Yet none of these aspects are accounted for in classical DST benchmarks. A more realistic environment would employ a dual-constraint where the agent needs to obey customer requests while considering company policies when taking actions.

That’s why, in order to improve the state of the art of task-oriented dialogue systems for customer service applications, we’re establishing a new Action-Based Conversations Dataset (ABCD). ABCD is a fully-labeled dataset with over 10k human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by company policies to achieve task success.

The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests. With this dataset, we propose two new tasks: Action State Tracking (AST)—which keeps track of the state of the dialogue when we know that an action has taken place during that turn; and Cascading Dialogue Success (CDS)—a measure for the model’s ability to understand actions in context as a whole, which includes the context from other utterances.

The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests.

Derek Chen

Dataset Characteristics

Unlike other large open-domain dialogue datasets often built for more general chatbot entertainment purposes, ABCD focuses deeper on increasing the count and diversity of actions and text within the domain of customer service. Dataset participants were additionally incentivized through financial bonuses when properly adhering to policy guidelines in handling customer requests, mimicking customer service environments and realistic agent behavior.

The training process to annotate the dataset, for example, at times felt like training for a real call center role. “I feel like I’m back at my previous job as a customer care agent in a call center,” said one MTurk agent who was involved in the study. “Now I feel ready to work at or interview for a real customer service role,” said another.

New Benchmarks

The novel features in ABCD challenges the industry to measure performance across two new dialogue tasks: Action State Tracking & Cascading Dialogue Success.

Action State Tracking (AST)

AST improves upon DST metrics by detecting the pertinent intent from customer utterances while also taking into account constraints from agent guidelines. Suppose a customer is entitled to a discount which will be offered by issuing a [Promo Code]. The customer might request 30% off, but the guidelines stipulate only 15% is permitted, which would make “30” a reasonable, but ultimately flawed slot-value. To measure a model’s ability to comprehend such nuanced situations, we adopt overall accuracy as the evaluation metric for AST.

Cascading Dialogue Success (CDS)

Since the appropriate action often depends on the situation, we propose the CDS task to measure a model’s ability to understand actions in context. Whereas AST assumes an action occurs in the current turn, the task of CDS includes first predicting the type of turn and its subsequent details. The types of turns are utterances, actions, and endings. When the turn is an utterance, the detail is to respond with the best sentence chosen from a list of possible sentences. When the turn is an action, the detail is to choose the appropriate slots and values. Finally, when the turn is an ending, the model should know to end the conversation. This score is calculated on every turn, and the model is evaluated based on the percent of remaining steps correctly predicted, averaged across all available turns.

Why This Matters

For customer service and call center applications, it is time for both the research community and industry to do better. Models relying on DST as a measure of success have little indication of performance in real world scenarios, and discerning CX leaders should look to other indicators grounded in the conditions that actual call center agents face.

Rather than relying on general datasets which expand upon an obtuse array of knowledge base lookup actions, ABCD presents a corpus for building more in-depth task-oriented dialogue systems. The availability of this dataset and two new tasks creates new opportunities for researchers to explore better, more reliable, models for task-oriented dialogue systems.

We can’t wait to see what the community creates from this dataset. Our contribution to the field with this dataset is another major step to improving machine learning models in customer service.

Read the Complete Paper, & Access the Dataset

This work has been accepted at NAACL 2021. Meet the authors on June 8th, 20:00—20:50 EST, where this work will be presented as a part of “Session 9A-Oral: Dialogue and Interactive Systems.”

Recently Published

Browse Blog

For AI agent solutions, infrastructure that allows you to test, train, monitor, and govern your AI agents before and after they go live are must-haves. Here are the three capabilities to start with.

Discover 8 high-impact use cases for AI agents in retail CX—from returns to upsells—and how to boost service speed, scale, and satisfaction.

Learn what CX leaders in financial services should expect from AI agents—and why safety and security must go far beyond the basics.

The challenge isn’t cost—it’s trust. GenerativeAgent delivers enterprise-ready AI with tools for safe testing, human review, and live monitoring.

Measuring GenAI agents isn’t about sounding human. It’s about outcomes. Here’s what to track to protect your brand and bottom line.

A generative AI agent isn’t built to mimic humans—it’s built to deliver faster, safer, more consistent results in customer service.

How Assurant is using generative AI to boost CX, empower agents, and move toward agentic AI—starting with strategy, not shortcuts.

Discover 6 powerful use cases for AI agents in financial services to boost customer service, cut costs, and scale support with confidence.

Get full visibility into GenerativeAgent’s performance with tools that surface issues, show decision paths, and drive scalable CX quality and ROI.

How Tangerine Bank is using AI to boost CX, empower agents, and redefine the digital contact center—without losing the human touch.

Is your AI agent saving you money—or just creating the illusion of efficiency? Learn how to measure real impact with the metrics that matter.

Discover how autonomous AI agents solve key retail contact center challenges—scaling service, cutting costs, and improving customer experience.

Browse Blog