Browse by topic

Designing AutoCompose

Min Kim
Jun 3
2 mins
1 min

ASAPP researchers have spent the past 8+ years pushing the limits of machine learning to provide contact center agents with content suggestions that are astonishingly timely, relevant, and helpful. While these technological advancements are ground-breaking, they’re only part of the success of AutoCompose. Knowing what the agent should say or do next is only half the battle. Getting the agent to actually notice, trust, and engage with the content is an entirely different challenge.

While it may not be immediately noticeable to you as a customer, contact center agents often navigate a head-exploding mash-up of noisy applications and confusing CRMs when working towards a resolution. Beyond that, they juggle well-meaning protocols and policies that are intended to ensure quality and standardize workflows, but instead quickly become overwhelming and ineffective—if not counter-productive.

The ASAPP Design team took notice of this ever-growing competition for attention and sought to turn down the noise when designing AutoCompose.

Instead of getting bigger and louder in our guidance, we focused on a flexible and intuitive UI that gives agents the right amount of support at exactly the right time—all without being disruptive to their natural workflow.

Min Kim

We had several user experience imperatives when iterating on the AutoCompose UI design.

Be convenient, but not disruptive

Knowing where to showcase suggestions isn’t obvious. We experimented with several underperforming placements until landing on the effective solution: wedging the UI directly in the line of sight between the composer input and the chat log. The design was minimalist, keeping visual noise to a minimum and focusing on contrast and legibility.

Auto-suggest and phrase auto-complete

The value of AutoCompose stems from recognition rather than recall, which takes advantage of the human brain’s ability to digest recent and contextual information at a time. Instead of memorizing an infinite number of templates and commands, AutoCompose includes suggestions in multiple locations where an agent can recognize and choose. When the agent is in the middle of drafting a sentence, Phrase Auto-Complete prompts the suggested full sentence inline within the text input. As an agent types words and phrases, AutoSuggest gives the most relevant suggestions at the given context, located between the chat log and composer, so that the agent can stay informed about the chat context. By placing suggestions where they need it, agents can immediately recognize and utilize them with maximum efficiency.

Just the right amount

In UI design, there is often a fine line between too much and too little. We experienced this when evaluating the threshold for how many suggestions to display. AutoSuggest currently displays up to three suggestions that update in real-time as an agent types. We’ve been intentional about capping suggestions to a maximum of three, and do our best effort to make them relevant. The model only shows confident, quality-ensured suggestions above a set threshold. With this, the UI shows the right amount of suggestions that optimize for the cognitive load that agents can handle at a time.


Speed matters

Another critical component to the design is latency. To fit within an agent’s natural workflow, the suggestions must update within a fraction of a second—or risk the agent ignoring the suggestions altogether.

Specifically, a latency of less than 100ms ensures the agent feels a sense of direct manipulation associated with every keystroke. Beyond that, the updating of suggestions can fall behind the pace of conversation, making the experience painfully disjointed.

Support the long tail of use cases

In contact centers, when agents encounter complex issues, they may choose to resolve them differently depending on their tenure and experience. In these scenarios, we may not have the right answer, so we instantly shift our UX priorities to make it easy for the agents to find what they’re looking for.

Custom Response Drawer Search
Global Response Search Prompt

We focused on integrating search and other browsing (and use of shortcuts), all in a compact, but extremely dynamic UI. Experienced agents may need to pull effective responses that they built on their own. Meanwhile, novice agents need more handholding to pull from company-provided response suggestions, also known as global responses. To accommodate both, we experimented with ways to introduce shortcuts like a drawer search inline within the text field, and a global response search that is prompted on top of AutoSuggest. AutoCompose now accommodates these long tail use cases with our dynamic, contextual UI approach.

What might seem like a simple UI is actually packed with details and nuanced interactions to maximize agent productivity. With subtle and intentional design decisions, we give the right amount of support to agents at the right time.

Contact Center
Measuring Success

Not all automation is the same

Heather Reed
May 25
2 mins

At ASAPP we develop AI models to improve agent performance. Many of these models directly assist agents by automating parts of their workflow. For example, the automated responses generated by AutoCompose suggest to an agent what to say at a given point during a customer conversation. Agents often use our suggestions by clicking and sending them.

How we measure performance matters

While usage of the suggestions is a great indicator of whether the agents like the features, we’re even more interested in the impact the automation has on performance metrics like agent handle time, concurrency, and throughput. These metrics are ultimately how we measure agent performance when evaluating the impact of a product like AutoCompose, but these metrics can be affected by things beyond AutoCompose usage, like changes in customer intents or poorly-planned workforce management.

To isolate the impact of AutoCompose usage on agent efficiency, we prefer to measure the specific performance gains from each individual usage of AutoCompose. We do this by measuring the impact of automated responses on agent response time, because response time is more invariant to intent shifts and organizational effects than handle time, concurrency and throughput.

By doing this, we can further analyze:

  • The types of agent utterances that are most often automated
  • The impact of the automated responses when different types of messages are used (in terms of time savings)

Altogether, this enables us to be data-driven about how we improve models and develop new features to have maximum impact.

Going beyond greeting and closing messages

When we train AI models to automate responses for agents, the models look for patterns in the data that can predict what to say next based on past conversation language. So the easiest things for models to learn well are the types of messages that occur often and without much variation across different types of conversations, e.g. greetings and closings. Agents typically greet and end a conversation with a customer the same way, perhaps with some specificity based on the customer’s intent.

Most AI-driven automated response products will correctly suggest greeting and closing messages at the correct time in the conversation. This typically accounts for the first 10-20% of automated response usage rates. But when we evaluate the impact of automating those types of messages, we see that it’s minimal.

To understand this, let’s look at how we measure impact. We compare agents’ response times when using automated responses against their response times when not using automated responses. The difference in time is the impact—it’s the time savings we can credit to the automation.

Without automation, agents are not manually typing greeting and closing messages for every conversation. Rather they’re copying and pasting from notepad or word documents containing their favorite messages. Agents are effective at this because they do it several times per conversation. They know exactly where their favorite messages are located, and they can quickly copy and paste them into their chat window. Each greeting or closing message might take an agent 2 seconds. When we automate those types of messages, all we are actually automating is the 2 second copy/paste. So when we see automation rates of 10-20%, we are likely only seeing a minimal impact on agent performance.

The impact lies in automating the middle of the conversation.

If automating the beginnings and endings of conversations is not that impactful, what is?

Heather Reed
Automating the middle of the conversation is where response times are naturally slowest and where automation can yield the most agent performance impact.

Heather Reed, PhD

The agent may not know exactly what to say next, requiring time to think or look up the right answers. It’s unlikely that the agent has a script readily available for copying or pasting. If they do, they are not nearly as efficient as they are with their frequently used greetings and closings.

ASAPP - Usage rate

Where it was easy for AI models to learn the beginnings and endings of conversations, because they most often occur the same way, the exact opposite is true of the middle parts of conversations. Often, this is where the most diversity in dialog occurs. Agents handle a variety of customer problems, and they solve them in a variety of ways. This results in extremely varied language throughout the middle parts of conversations, making it hard for AI models to predict what to say at the right time.

ASAPP’s research delivers the biggest improvements

Whole interaction models are exactly what the research team at ASAPP specializes in developing. And it’s the reason that AutoCompose is so effective. If we look at AutoCompose usage rates throughout a conversation, we see that while there is a higher usage at the beginnings and endings of conversations, AutoCompose still automates over half of agent responses in between.

The low response times in the middle of conversations are where we see the biggest improvements in agent response time. It’s also where the biggest opportunities for improvements are realized.

Whole interaction automation shows widespread improvements

ASAPP’s current automated response rate is about 75%. It has taken a lot of model improvements, new features, and user-tested designs to get there. But now agents effectively use our automated responses to reduce handle times by 16%, enabling an additional 15% of agent concurrency, for a combined improvement in throughput of 35%. The AI model continues to get better with use, improving suggestions, and becoming more useful to agents and customers.

Measuring Success

How to start assessing and improving the way your agents use their tools

Adrian Botta
May 19
2 mins

Customer care leaders tasked with improving customer satisfaction while also reducing cost often find it challenging to know where to begin. It’s hard to know what types of problems are ideal for self-serve, where the biggest bottlenecks exist, what workflows could be streamlined, and how to provide both the right training and targeted feedback to a large population of agents.

We talk with stakeholders from many companies who are charged with revamping the tools agents have at their disposal to improve agent effectiveness and decrease negative outcomes like call-backs and handle time.

Some of the first questions they ask are:

  • Where are my biggest problem areas?
  • What can I automate for self-service?
  • What needs to be streamlined?
  • Where are there gaps in my agents’ resources?
  • Are they using the systems and the tools we’re investing in for the problems they were designed to help solve?
  • What are my best agents doing?
  • Which agents need to be trained to use their tools more effectively? And on which tools?

These questions require an understanding of both the tools that agents use and the entire landscape of customer problems. This is the first blog in a series that details how an ASAPP AI Service, JourneyInsight, ties together what agents are saying and doing with key outcomes to answer these questions.

Adrian Botta
Using JourneyInsight, customer care leaders can make data-driven, impactful improvements to agent processes and agent training.

Adrian Botta

Most customers start with our diagnostic reports to identify problem areas and help them prioritize improvements. They then use our more detailed reports that tie agent behavior with outcomes and compare performance across agent groups to drive impactful changes.

These Diagnostic reports provide visibility and context behind KPIs that leaders have not had before.

Our Time-on-Tools report captures how much time agents spend on each of their tools for each problem type or intent. This enables a user to:

  • Compare tool usage for different intents,
  • Understand the distribution of time spent on each tool,
  • Compare times across different agent groups.

With this report, it’s easy to see the problem types where agents are still relying on legacy tools or how the 20% least tenured agents spend 30% more time on the payments page than their colleagues do for a billing related question.

Our Agent Effort report captures the intensity of the average issue for each problem type or intent, This enables a user to:

  • Compare how many systems are used,
  • See how frequently agents switch between different tools,
  • Understand how they have to interact with those tools to solve the customer’s problem.

With this report, it’s easy to identify which problem types require the most effort to resolve, how the best agents interact with their tools, and how each agent stacks up against the best agents.

These examples illustrate some of the ways our customers have used these reports to answer key questions.

What can I automate for self-service?

When looking for intents to address with self-serve capabilities, it is critical to know how much volume could be reduced and the cost of implementation. The cost can be informed by how complex the problem-solving workflows are and which systems will need to be integrated with.

Our diagnostic reports for one customer for a particular intent showed that:

  • Agents use one tool for 86% of the issue and agents switch tabs infrequently during the issue
  • Agents copy and paste information from that primary tool to the chat on average 3.2 times over the course of the issue
  • The customer usually has all of the information that an agent needs to fill out the relevant forms
  • Agents consult the knowledge base in only 2.8% of the issues within that intent
  • There is very little variation in the way the agents solve that problem

All of these data points indicate that this intent is simple, consistent, and requires relaying information from one existing system to a customer. This makes it a good candidate to build into a self-serve virtual agent flow.

Our more detailed process discovery reports can identify multiple workflows per intent and outline each workflow. They also provide additional details and statistics needed to determine whether the workflow is ideal for automation.

Where are there gaps in my agents’ resources?

Correct resource usage is generally determined by the context of the problem and the type or level of agent using the resource.

Our diagnostic reports for one customer for a particular intent showed that:

  • A knowledge base article that was written to address the majority of problems for a certain intent is being accessed during only 4% of the issues within that intent
  • Agents are spending 8% of their time during that intent reaching out to their colleagues through an internal web-based messaging tool (e.g. Google Chat)
  • Agents access an external search tool (e.g. Google search) 19% of the time to address this intent.
  • There are very inconsistent outcomes, such as AHT ranging from 2 minutes to 38 minutes and varying callback rates by agent group

These data points suggest that this intent is harder to solve than expected and that resources need to be updated to provide agents with the answers they need. It is also possible that agents were not aware of the resources that have been designed to help solve that problem.

We followed up to take a deeper look with our more detailed outcome drivers report. It showed that when agents use the knowledge base article written for that intent, callback rates are lower. This indicates that the article likely does, in fact, help agents resolve the issue.

In subsequent posts, we’ll describe how we drive more value using predictive models and sequence mining to help identify root causes of negative outcomes and where newer agents are deviating from more tenured agents.


To measure the performance of Conversational AI, we need more strict, better quality benchmarks

Suwon Shon
May 13
2 mins

Introducing the Spoken Language Understanding Evaluation (SLUE) benchmark suite

Progress on speech processing has benefited from shared datasets and benchmarks. Historically, these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. However, “higher-level” spoken language understanding (SLU) tasks have received less attention and resources in the speech community. There are numerous tasks at varying linguistic levels that have been benchmarked extensively for text input by the natural language processing (NLP) community – named entity recognition, parsing, sentiment analysis, entailment, summarization, and so on – but they have not been as thoroughly addressed for speech input.

Consequently, SLU is at the intersection of speech and NLP fields but was not addressed seriously from either side. We think that the biggest reason for this disconnect is due to a lack of an appropriate benchmark dataset. This lack makes performance comparisons very difficult and raises the barriers of entry into this field. A high quality benchmark would allow both the speech and NLP community to address open research questions about SLU—such as which tasks can be addressed well by pipeline ASR+NLP approaches, and which applications benefit from having end to end or joint modeling. And, for the latter kind of tasks, how to best extract the needed speech information.

Suwon Shon
For conversational AI to advance, the broader scientific community must be able to work together and explore with easily accessible state-of-the-art baselines for fair performance comparisons.

Suwon Shon, PhD

We believe that for conversational AI to advance, the broader scientific community must be able to work together and explore with easily accessible state-of-the-art baselines for fair performance comparisons. A present lack of benchmarks of this kind is our main motivation in establishing the SLUE benchmark and its suite.

The first phase of SLUE

We are launching the first benchmark which considers ASR, NER, and SLU with a particular emphasis on low-resource SLU. For this benchmark, we contribute the following:

  1. New annotation of publicly available, natural speech data for training and evaluation on new tasks, specifically named entity recognition (NER) and sentiment analysis (SA), as well as new text transcriptions for training and evaluating ASR systems on the same data.
  2. A benchmark suite including a toolkit for reproducing state-of-the-art baseline models and evaluation, the annotated data, website, and leaderboard.
  3. A variety of baseline models that can be reproduced to measure the state of existing models on these new tasks.
  4. A small size labeled dataset to address a new algorithm and findings for low-resource SLU tasks

SLUE covers 2 SLU tasks (NER and SA) + ASR tasks. All evaluation in this benchmark starts with the speech as input whether it is a pipeline approach (ASR+NLP model) or end-to-end model that predicts results directly from speech.

The provided SLUE benchmark suite covers for downloading dataset, training state-of-the-art baselines and evaluation with high-quality annotation. In the website, we provide the online leaderboard to follow the up-to-date performance and we strongly believe that the SLUE benchmark makes SLU tasks much more easily accessible and researchers can focus on problem-solving.

Current leaderboard of the SLUE benchmark

Why it matters

Recent SLU-related benchmarks have been proposed with similar motivations to SLUE. However, those benchmarks cannot perform as comprehensively as SLUE due to the following reasons:

  1. Some of their tasks already achieve nearly perfect performance (SUPERB, ATIS), not enough to discriminate between different approaches.
  2. Other benchmark datasets consist of artificial (synthesized) rather than natural speech (SLURP), which don’t recreate real-world conditions
  3. There is no training audio available while only providing audio for evaluation (ASR-GLUE)
  4. Other benchmark datasets use short speech commands rather than longer conversational speech (SLURP, FSC)
  5. Have license constraints limiting their industry use (Switchboard NXT, FSC)

SLUE provides a comprehensive comparison between models without those shortcomings. An expected contribution to the SLUE benchmark would

  1. Track research progress on multiple SLU tasks,
  2. Facilitate the development of pre-trained representations by providing fine-tuning and eval sets for a variety of SLU tasks,
  3. Foster the open exchange of research by focusing on freely available datasets that all academic and industrial groups can easily use.

Motivated by the growing interest in SLU tasks and recent progress on pre-trained representations, we have proposed a new benchmark suite consisting of newly annotated fine-tuning and evaluation sets, and have provided annotations and baselines for new NER, sentiment, and ASR evaluations. For the initial study of the SLUE benchmark, we evaluated numerous baseline systems using current state-of-the-art speech and NLP models.

This work is open to all researchers in the multidisciplinary community. We welcome similar research efforts focused on low-resource SLU, so we can continue to expand this benchmark suite with more tests and data. To contribute or expand on our open-source dataset, please email or get in touch with us at

Additional Resources

  1. Attend our ICASSP 2022 session
  3. Presentation Time: Thu, 12 May, 08:00 – 08:45 New York Time (UTC -4)
  4. Attend our Interspeech 2022 special session “low-resource SLU”
  5. September 18-22, 2022, Incheon, South Korea
  6. Paper
  7. SLUE Benchmark Suite (Toolkit and dataset)
  8. Website and leaderboard
  9. Email me ( or get in touch with us @ASAPP.
Contact Center
Measuring Success

A contact center case study about call summarization strategies

Gonzalo Chebi
Apr 14
2 mins

All agents at a contact center are typically required to write summary – or disposition – notes for each conversation. These notes are intended to be used for several purposes. They provide context if the issue needs to be revisited in follow up calls. This avoids the need for the customer to repeat the problem and saves the agent time. Also, supervisors can use these notes to see how often certain situations arise and identify coaching opportunities. Good disposition notes will include the customer’s contact reason, key actions taken to solve it and the conversation outcome.

The time required to take these notes is, on average, 10% of the actual call duration and agents may only capture some aspects of the conversation. This is why many contact center leaders are looking for ways to reduce the time spent writing these notes and increase their quality.

Findings from a large enterprise contact center

Like most contact centers, agents in this company were writing all notes at the end of each conversation. Aiming to increase agent’s utilization (i.e. the proportion of time agents are talking with a customer) they shifted to having their agents write the notes during, and not after, the conversation. They encourage them to write these notes in “natural pauses” inside the conversation. This way, agents reduce significantly the time between when a conversation ends and the next conversation starts.

In reviewing call data we learned that in a big proportion of the voice calls, these natural pauses do not occur very often. To understand this, for each conversation we first identify the time intervals in which the customer or the agent are talking. This can be observed in Figure 1 below. Based on these customer and agent turn intervals, we can identify the pauses in the conversation. For this analysis, we only keep the pauses which have a duration of at least 10 seconds.

Figure 1: intervals where the customer and the agents are talking, along with the pauses of at least 10 seconds.

As we show in the histogram from Figure 2, we estimate that half of the calls have less than one pause every two minutes and 13% of the calls have no pauses at all. Moreover, for most of those pauses, agents are busy actively working on the issue (looking for information, filling forms, etc.), so taking notes is not a possibility.

Figure 2: Histogram of the number of pauses per minute in voice calls. Here, a pause is defined as the time intervals with duration of at least 10 seconds where neither the rep nor the customer are talking.

This means that when an agent takes notes in the middle of the conversation they are usually creating an artificial pause. In other words, they are transferring the time it would have taken to take the notes at the end, to more time spent on the call with each customer. Moreover, when they don’t finish notes during the call, note-taking for that call spills over into the next call which significantly increases the complexity for the agent.

Gonzalo Chebi
Having agents take notes during the conversation does not improve efficiency and may harm the overall customer experience.

Gonzalo Chebi, PhD

On the customer side, pauses in the middle of the conversation likely have negative consequences. Our data consistently shows that conversations with longer response times are associated with a lower Customer Satisfaction (CSAT) score, as we show in Figure 3 (the CSAT is on a scale from 1 to 5 here). In addition to waiting through pauses, the overall time the customer (and the agent) spend on the call is longer.

Figure 3: We calculated the longest pause in each conversation and bucketed this variable with a bucket size of 5 seconds. Each point represents one bucket: the x-axis corresponds to the lower end of the bucket range and the y-axis corresponds to the average CSAT score for all the conversations in that bucket.

The value of automating call summaries

We already showed that taking notes during the call does not improve agent efficiency and may harm the overall customer experience. On the other hand, automating conversation summaries can be a way to reduce or completely eliminate dispositioning time for the agents as well as increase the general quality of the summaries.

The customer in this case study is initially making the automated summaries visible in their agent desk, enabling the agents to review and edit.

This has significantly reduced the time agents devote to this task.

As confidence in the AutoSummary model grows companies may opt to remove manual reviews completely from agents’ task list—and take the additional efficiency gains available. Other customers bypass this step and use AutoSummary without any agent engagement from the start.

Measuring Success

Why your care strategy must consider issue complexity and urgency

Rachel Knaster
Apr 1
2 mins

A common trait of people working in technology is a desire to be able to cleanly categorize information, data, issues, etc. To be able to delineate between one bucket and another. We see this manifest in how companies think about customer conversations—should a conversation be automated? Yes or no? Does the customer need a live engagement for the entirety of the conversation vs. more asynchronous? Yes or no? But customer conversations aren’t actually so clearcut, and the needs don’t stay consistent as conversations and customer journeys go on.

At ASAPP, we have developed a fairly unique way of thinking about conversations. Rather than relying on a single intent to determine how the entire conversation should be handled, let’s look at each turn of the conversation to better inform what the next step should be. Every request has different needs, which change considerably based on various factors.

The above graph provides a nice illustration of how we can think about the issues. Along the y-axis, you have more complex vs. more simple interactions. At the bottom, you have conversations that are well served to be fully automated without any agent intervention. On the top, you have the opposite—conversations that benefit from having a skilled agent along for the ride. But those are extremes, most conversations fall in between the two, they require some human involvement and a bunch of automation. By thinking in very binary terms, automated or not automated, you lose out on all of the opportunities to reduce agent workload on a conversation by 20%, by 50%, by 75%. By treating each piece of a conversation as worthy of its own classification and diagnosis, you bring a lot of efficiency back into your business without risking frustrating your customer.

Now the x-axis, here we’re thinking about how routine vs. how urgent the issue is. It’s easy to think “we can serve customers asynchronously, they send an SMS. We get back to them when we get back to them, just like customers are used to interacting with friends and family.” But that leaves out a very important part of the picture. While many conversations are routine and can benefit from more asynchronous interactions, allowing companies to load balance workload on agents, there are cases where customers need urgent help—make a change to a flight about to take off, help resolve billing issue just before superbowl kick off, and in those cases, you don’t want to risk a customer not getting a response in time, especially not when so many conversions didn’t need that live resolution. Then there are cases just as with complexity vs. simplicity that are in between—an initial response might need help from a live agent, cutting off access to a bank account in the case of fraud, but the follow-ups and resolutions are well-served for asynchronous communication.

Rachel Knaster
Customer interactions require different levels of attention. From simple routine issues to urgent complex requests, organizations must be able to seamlessly support every type of need, in the most efficient way possible, using the right mix of agent and automation.

Rachel Knaster

In addition to the content of what the customer is asking about, it’s important to take in every parameter you know about them and the context surrounding their issue. This goes far beyond simple intent classification. In order to determine the type of service customers need, you need to look at the entire weight of their requests. The best way to think about it is along axes of complexity and urgency.

Based on where they fall on this graph, customer interactions require different levels of attention. From simple routine issues (C) to urgent complex requests (A), organizations must be able to seamlessly support every type of need, in the most efficient way possible, using the right mix of agent and automation.

Is the customer’s question simple to solve? Then let’s automate it.

Is it complicated? Then let’s connect them with our frontline and have those agents do what they do best.

Is the issue one that can wait for an answer and more asynchronous by nature? Then let’s treat it that way.

Or is a customer’s flight about to take off and they need help? Let’s immediately connect them with someone.

These are fundamental questions contact centers should consider with every incoming request. There’s “no one size fits all” when it comes to CX strategy. Every interaction requires a different approach. so you can maximize throughput while keeping each customer satisfied.

Consider the graph above. Each quadrant represents a different category of request with its own unique considerations. In each case, the right mixture of live agent and AI, synchronous and asynchronous support can help solve the issue in the most optimal way possible. Here’s the ideal for each:

  1. Complex, urgent
  2. Agent-based, synchronous
  3. Low agent concurrency
  4. Automate part of agent workload
  5. Opportunity to mix voice and digital in same live conversation for faster resolution
  6. Complex, routine
  7. Agent-based, asynchronous
  8. Automate part of agent workload
  9. High agent concurrency
  10. Handoff to phone if required
  11. Simple, routine
  12. Fully automated interaction
  13. Low cost to serve
  14. Simple, urgent
  15. Fully automated, with fast escalation to live agent
  16. Complete history (context) of interaction required for agent
  17. Medium-high agent concurrency
  18. Automate part of agent workload
  19. Opportunity to mix voice and digital in same live conversation for faster resolution

While companies might prefer everything be automated or self service, that’s not always the most efficient way to solve an issue. Of course, neither is having your agents occupied addressing routine tasks all day. What’s needed is the right balance between the two—AI enhancing human performance so agents can handle more tasks and fully concentrate on those that need it. This is where more sophisticated machine learning offers incredible value.

There is an opportunity for AI to assist in every interaction, whether it’s handling the entire request or just part of the workload. While typically considered most helpful for automating simple tasks, the right AI models will improve over time, learning from customer interactions to assist with increasingly complex issues.

A single conversation can also become more simple or complex as it evolves, calling for changing levels of agent attention. For instance, now that the primary issue has been resolved, can the rest of this interaction be automated? Or has the issue escalated from automation to the need for an agent? Instant intent analysis provided by machine learning can help identify these occurrences to further optimize agent concurrency.

The truth is, sometimes the best thing is to have an agent live with just one customer, and sometimes it’s to have them handling multiple conversations. What’s important is for each organization to recognize the nuance and to build flexible solutions that adapt for the best outcomes to ensure operational performance is being enhanced, while never compromising on a personalized and connected experience for customers.

R&D Innovations

How to Understand Different Levels of AI Systems

Michael Griffiths
Mar 11
2 mins

AI systems have additional considerations over traditional software. A key difference is in the maintenance cost. Most of the cost of an AI system happens after the code has been deployed. ML models degrade over time without ongoing investment in data and hyperparameter tuning.

The cost structure of AI systems are directly affected by these design decisions; the level of service, and improvement over time are categorically different across different levels. Knowing the level of the AI system can help practitioners and customers predict how the system will change over time – whether it will continuously improve, remain the same, or even degrade.
Levels of AI Systems start at traditional software (Level 0) and progress up to fully Intelligent software (Level 4). Systems at Level 4 essentially maintain and improve on their own – they require negligible work. At ASAPP we call Level 4 AI Native®.

Moving up a level has trade-offs for practitioners and customers. For example, moving from Level 1 to Level 2 reduces ongoing data requirements and customization work, but introduces a self-reinforcing bias problem that could cause the system to degrade over time. Choosing to move up a level requires practitioners to recognize the new challenges, and the actions to take in designing an AI system.

While there are significant benefits in scalability (and typically performance/robustness/etc) in moving up levels, it’s important to say that most systems are best designed at Level 0 or Level 1. These levels are the most predictable: performance should remain roughly stable over time, and there are obvious mechanisms to improve performance (e.g. for Level 1, add more annotated training data).

AI Levels

Designing AI systems is different from traditional software development, because the behavior of the system is learned – and can potentially change over time once deployed. When practitioners build AI systems, it can be useful to talk about their “level”, just like SAE has levels for self-driving cars.

Michael Griffiths
Moving up a level has trade-offs for practitioners and customers. This requires practitioners to recognize the new challenges, and the actions to take in designing an AI system

Michael Griffiths

Level 0: Deterministic

No required training data, no required testing data

Algorithms that involve no learning (e.g. adapting parameters to data) are at level zero.
The great benefit of level 0 (traditional algorithms in computer science) is that they are very reliable and, if you solve the problem, can be shown to be the optimal solution. If you can solve a problem at level 0 it’s hard to beat. In some respect, all algorithms–even sorting algorithms (like binary search) – are “adaptive” to the data. We do not generally consider sorting algorithms to be “learning”. Learning involves memory–the system changing how it behaves in the future, based on what it’s learned in the past.

However, some problems defy a pre-specified algorithmic solution. The downside is that for problems that defy human understanding (either once, or in number) it can be difficult to perform well (e.g. speech to text, translation, image recognition, utterance suggestion, etc.).


  • Luhn Algorithm for credit card validation
  • Regex-based systems (e.g. simple redaction systems for credit card numbers).
  • Information retrieval algorithms like TFIDF retrieval or BM25.
  • Dictionary-based spell correction.

Note: In some cases, there can be a small number of parameters to tune. For example, ElasticSearch provides the ability to modify BM25 parameters. We can regard these as tuning parameters, i.e. set and forget. This is a blurry line.

Level 1: Learned
Static training data, static testing data

Systems where you train the model in an offline setting and deploy to production with “frozen” weights. There may be an updating cadence to the model (e.g. adding more annotated data), but the environment the model operates in does not affect the model.

The benefit of level 1 is that you can learn and deploy any function at the modest cost of some training data. This is a great place to experiment with different types of solutions. And, for problems with common elements (e.g. speech recognition) you can benefit from diminishing marginal costs.

The downside is that customization to a single use case is linear in their number: you need to curate training data for each use case. And that can change over time, so you need to continuously add annotations to preserve performance. This cost can be hard to bear.


  • Custom text classification models
  • Speech to text (acoustic model)

Level 2: Self-learning

Dynamic + static training data, static testing data

Systems that use training data generated from the system for the model to improve. In some cases, the data generation is independent of the model (so we expect increasing model performance over time as more data is added); in other cases, the model intervening can reinforce model biases and performance can get worse over time. To eliminate the chance of reinforcing biases, practitioners need to evaluate new models on static (potentially annotated) data sets.

Level 2 is great because performance seems to improve over time for free. The downside is that, left unattended, the system can get worse – it may not be consistent in getting better with more data. The other limitation is that some systems at level two might have limited capacity to improve as they essentially feed on themselves (generating their own training data); addressing this bias can be challenging.


  • Naive spam filters
  • Common speech to text models (language model)

Level 3: Autonomous (or self-correcting)

Dynamic training data, dynamic test data

Systems that both alter human behavior (e.g. recommend an action and let the user opt-in) and learn directly from that behavior, including how the systems’ choice changes the user behavior. Moving from Level 2 to 3 potentially represents a big increase in system reliability and total achievable performance.

Level 3 is great because it can consistently get better over time. However, it is more complex: it might require truly staggering amounts of data, or a very carefully designed setup, to do better than simpler systems; its ability to adapt to the environment also makes it very hard to debug. It is also possible to have truly catastrophic feedback loops. For example, a human corrects an email spam filter – however, because the human can only ever correct misclassifications that the system made, it learns that all its predictions are wrong and inverts its own predictions.

Level 4: Intelligent (or globally optimizing)

Dynamic training data, dynamic test data, dynamic goal

Systems that both dynamically interact with an environment and globally optimizes (e.g. towards some set of downstream objectives), e.g. facilitating an agent while optimizing for AHT and CSAT, or optimizing directly for profit. For example, an AutoCompose system that optimizes for the best series of clicks to optimize the conversation.

Level 4 can be very attractive. However, it is not always obvious how to get there, and unless carefully designed, these systems can optimize towards degenerate solutions. Aiming them at the right problem, shaping the reward, and auditing its behavior are large and non-trivial tasks.

Why consider levels?

Designing and building AI systems is difficult. A core part of that difficulty is understanding how they change over time (or don’t change!): how the performance, and maintenance cost, of the system will develop.

In general, there is increasing value as you move up levels, e.g. one goal might be to move a system operating at Level 1 to be at Level 2 – but complexity (and cost) of system build also increases as levels go up. It can make a lot of sense to start with a novel feature at a “low” level, where the system behavior is well understood, and progressively increase the level – as understanding the failure cases of the system becomes more difficult as the level increases.

The focus should be on learning about the problem and the solution space. Lower levels are more consistent and can be much better avenues to explore possible solutions than higher levels, whose cost and variability in performance can be large hindrances.
This set of levels provides some core breakpoints for how different AI systems can behave. Employing these levels – and making trade-offs between levels – can help provide a shorthand for differences post-deployment.

Matrix Layout

No results found.
No items found.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get Started

AI Services Value Calculator

Estimate your cost savings

contact us

Request a Demo

Transform your enterprise with generative AI • Optimize and grow your CX •
Transform your enterprise with generative AI • Optimize and grow your CX •