Browse by topic
Measuring Success

How to start assessing and improving the way your agents use their tools

Adrian Botta
May 19
2 mins

Customer care leaders tasked with improving customer satisfaction while also reducing cost often find it challenging to know where to begin. It’s hard to know what types of problems are ideal for self-serve, where the biggest bottlenecks exist, what workflows could be streamlined, and how to provide both the right training and targeted feedback to a large population of agents.

We talk with stakeholders from many companies who are charged with revamping the tools agents have at their disposal to improve agent effectiveness and decrease negative outcomes like call-backs and handle time.

Some of the first questions they ask are:

  • Where are my biggest problem areas?
  • What can I automate for self-service?
  • What needs to be streamlined?
  • Where are there gaps in my agents’ resources?
  • Are they using the systems and the tools we’re investing in for the problems they were designed to help solve?
  • What are my best agents doing?
  • Which agents need to be trained to use their tools more effectively? And on which tools?

These questions require an understanding of both the tools that agents use and the entire landscape of customer problems. This is the first blog in a series that details how an ASAPP AI Service, JourneyInsight, ties together what agents are saying and doing with key outcomes to answer these questions.

Adrian Botta
Using JourneyInsight, customer care leaders can make data-driven, impactful improvements to agent processes and agent training.

Adrian Botta

Most customers start with our diagnostic reports to identify problem areas and help them prioritize improvements. They then use our more detailed reports that tie agent behavior with outcomes and compare performance across agent groups to drive impactful changes.

These Diagnostic reports provide visibility and context behind KPIs that leaders have not had before.

Our Time-on-Tools report captures how much time agents spend on each of their tools for each problem type or intent. This enables a user to:

  • Compare tool usage for different intents,
  • Understand the distribution of time spent on each tool,
  • Compare times across different agent groups.

With this report, it’s easy to see the problem types where agents are still relying on legacy tools or how the 20% least tenured agents spend 30% more time on the payments page than their colleagues do for a billing related question.

Our Agent Effort report captures the intensity of the average issue for each problem type or intent, This enables a user to:

  • Compare how many systems are used,
  • See how frequently agents switch between different tools,
  • Understand how they have to interact with those tools to solve the customer’s problem.

With this report, it’s easy to identify which problem types require the most effort to resolve, how the best agents interact with their tools, and how each agent stacks up against the best agents.

These examples illustrate some of the ways our customers have used these reports to answer key questions.

What can I automate for self-service?

When looking for intents to address with self-serve capabilities, it is critical to know how much volume could be reduced and the cost of implementation. The cost can be informed by how complex the problem-solving workflows are and which systems will need to be integrated with.

Our diagnostic reports for one customer for a particular intent showed that:

  • Agents use one tool for 86% of the issue and agents switch tabs infrequently during the issue
  • Agents copy and paste information from that primary tool to the chat on average 3.2 times over the course of the issue
  • The customer usually has all of the information that an agent needs to fill out the relevant forms
  • Agents consult the knowledge base in only 2.8% of the issues within that intent
  • There is very little variation in the way the agents solve that problem

All of these data points indicate that this intent is simple, consistent, and requires relaying information from one existing system to a customer. This makes it a good candidate to build into a self-serve virtual agent flow.

Our more detailed process discovery reports can identify multiple workflows per intent and outline each workflow. They also provide additional details and statistics needed to determine whether the workflow is ideal for automation.

Where are there gaps in my agents’ resources?

Correct resource usage is generally determined by the context of the problem and the type or level of agent using the resource.

Our diagnostic reports for one customer for a particular intent showed that:

  • A knowledge base article that was written to address the majority of problems for a certain intent is being accessed during only 4% of the issues within that intent
  • Agents are spending 8% of their time during that intent reaching out to their colleagues through an internal web-based messaging tool (e.g. Google Chat)
  • Agents access an external search tool (e.g. Google search) 19% of the time to address this intent.
  • There are very inconsistent outcomes, such as AHT ranging from 2 minutes to 38 minutes and varying callback rates by agent group

These data points suggest that this intent is harder to solve than expected and that resources need to be updated to provide agents with the answers they need. It is also possible that agents were not aware of the resources that have been designed to help solve that problem.

We followed up to take a deeper look with our more detailed outcome drivers report. It showed that when agents use the knowledge base article written for that intent, callback rates are lower. This indicates that the article likely does, in fact, help agents resolve the issue.

In subsequent posts, we’ll describe how we drive more value using predictive models and sequence mining to help identify root causes of negative outcomes and where newer agents are deviating from more tenured agents.


To measure the performance of Conversational AI, we need more strict, better quality benchmarks

Suwon Shon
May 13
2 mins

Introducing the Spoken Language Understanding Evaluation (SLUE) benchmark suite

Progress on speech processing has benefited from shared datasets and benchmarks. Historically, these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. However, “higher-level” spoken language understanding (SLU) tasks have received less attention and resources in the speech community. There are numerous tasks at varying linguistic levels that have been benchmarked extensively for text input by the natural language processing (NLP) community – named entity recognition, parsing, sentiment analysis, entailment, summarization, and so on – but they have not been as thoroughly addressed for speech input.

Consequently, SLU is at the intersection of speech and NLP fields but was not addressed seriously from either side. We think that the biggest reason for this disconnect is due to a lack of an appropriate benchmark dataset. This lack makes performance comparisons very difficult and raises the barriers of entry into this field. A high quality benchmark would allow both the speech and NLP community to address open research questions about SLU—such as which tasks can be addressed well by pipeline ASR+NLP approaches, and which applications benefit from having end to end or joint modeling. And, for the latter kind of tasks, how to best extract the needed speech information.

Suwon Shon
For conversational AI to advance, the broader scientific community must be able to work together and explore with easily accessible state-of-the-art baselines for fair performance comparisons.

Suwon Shon, PhD

We believe that for conversational AI to advance, the broader scientific community must be able to work together and explore with easily accessible state-of-the-art baselines for fair performance comparisons. A present lack of benchmarks of this kind is our main motivation in establishing the SLUE benchmark and its suite.

The first phase of SLUE

We are launching the first benchmark which considers ASR, NER, and SLU with a particular emphasis on low-resource SLU. For this benchmark, we contribute the following:

  1. New annotation of publicly available, natural speech data for training and evaluation on new tasks, specifically named entity recognition (NER) and sentiment analysis (SA), as well as new text transcriptions for training and evaluating ASR systems on the same data.
  2. A benchmark suite including a toolkit for reproducing state-of-the-art baseline models and evaluation, the annotated data, website, and leaderboard.
  3. A variety of baseline models that can be reproduced to measure the state of existing models on these new tasks.
  4. A small size labeled dataset to address a new algorithm and findings for low-resource SLU tasks

SLUE covers 2 SLU tasks (NER and SA) + ASR tasks. All evaluation in this benchmark starts with the speech as input whether it is a pipeline approach (ASR+NLP model) or end-to-end model that predicts results directly from speech.

The provided SLUE benchmark suite covers for downloading dataset, training state-of-the-art baselines and evaluation with high-quality annotation. In the website, we provide the online leaderboard to follow the up-to-date performance and we strongly believe that the SLUE benchmark makes SLU tasks much more easily accessible and researchers can focus on problem-solving.

Current leaderboard of the SLUE benchmark

Why it matters

Recent SLU-related benchmarks have been proposed with similar motivations to SLUE. However, those benchmarks cannot perform as comprehensively as SLUE due to the following reasons:

  1. Some of their tasks already achieve nearly perfect performance (SUPERB, ATIS), not enough to discriminate between different approaches.
  2. Other benchmark datasets consist of artificial (synthesized) rather than natural speech (SLURP), which don’t recreate real-world conditions
  3. There is no training audio available while only providing audio for evaluation (ASR-GLUE)
  4. Other benchmark datasets use short speech commands rather than longer conversational speech (SLURP, FSC)
  5. Have license constraints limiting their industry use (Switchboard NXT, FSC)

SLUE provides a comprehensive comparison between models without those shortcomings. An expected contribution to the SLUE benchmark would

  1. Track research progress on multiple SLU tasks,
  2. Facilitate the development of pre-trained representations by providing fine-tuning and eval sets for a variety of SLU tasks,
  3. Foster the open exchange of research by focusing on freely available datasets that all academic and industrial groups can easily use.

Motivated by the growing interest in SLU tasks and recent progress on pre-trained representations, we have proposed a new benchmark suite consisting of newly annotated fine-tuning and evaluation sets, and have provided annotations and baselines for new NER, sentiment, and ASR evaluations. For the initial study of the SLUE benchmark, we evaluated numerous baseline systems using current state-of-the-art speech and NLP models.

This work is open to all researchers in the multidisciplinary community. We welcome similar research efforts focused on low-resource SLU, so we can continue to expand this benchmark suite with more tests and data. To contribute or expand on our open-source dataset, please email or get in touch with us at

Additional Resources

  1. Attend our ICASSP 2022 session
  3. Presentation Time: Thu, 12 May, 08:00 – 08:45 New York Time (UTC -4)
  4. Attend our Interspeech 2022 special session “low-resource SLU”
  5. September 18-22, 2022, Incheon, South Korea
  6. Paper
  7. SLUE Benchmark Suite (Toolkit and dataset)
  8. Website and leaderboard
  9. Email me ( or get in touch with us @ASAPP.
Contact Center
Measuring Success

A contact center case study about call summarization strategies

Gonzalo Chebi
Apr 14
2 mins

All agents at a contact center are typically required to write summary – or disposition – notes for each conversation. These notes are intended to be used for several purposes. They provide context if the issue needs to be revisited in follow up calls. This avoids the need for the customer to repeat the problem and saves the agent time. Also, supervisors can use these notes to see how often certain situations arise and identify coaching opportunities. Good disposition notes will include the customer’s contact reason, key actions taken to solve it and the conversation outcome.

The time required to take these notes is, on average, 10% of the actual call duration and agents may only capture some aspects of the conversation. This is why many contact center leaders are looking for ways to reduce the time spent writing these notes and increase their quality.

Findings from a large enterprise contact center

Like most contact centers, agents in this company were writing all notes at the end of each conversation. Aiming to increase agent’s utilization (i.e. the proportion of time agents are talking with a customer) they shifted to having their agents write the notes during, and not after, the conversation. They encourage them to write these notes in “natural pauses” inside the conversation. This way, agents reduce significantly the time between when a conversation ends and the next conversation starts.

In reviewing call data we learned that in a big proportion of the voice calls, these natural pauses do not occur very often. To understand this, for each conversation we first identify the time intervals in which the customer or the agent are talking. This can be observed in Figure 1 below. Based on these customer and agent turn intervals, we can identify the pauses in the conversation. For this analysis, we only keep the pauses which have a duration of at least 10 seconds.

Figure 1: intervals where the customer and the agents are talking, along with the pauses of at least 10 seconds.

As we show in the histogram from Figure 2, we estimate that half of the calls have less than one pause every two minutes and 13% of the calls have no pauses at all. Moreover, for most of those pauses, agents are busy actively working on the issue (looking for information, filling forms, etc.), so taking notes is not a possibility.

Figure 2: Histogram of the number of pauses per minute in voice calls. Here, a pause is defined as the time intervals with duration of at least 10 seconds where neither the rep nor the customer are talking.

This means that when an agent takes notes in the middle of the conversation they are usually creating an artificial pause. In other words, they are transferring the time it would have taken to take the notes at the end, to more time spent on the call with each customer. Moreover, when they don’t finish notes during the call, note-taking for that call spills over into the next call which significantly increases the complexity for the agent.

Gonzalo Chebi
Having agents take notes during the conversation does not improve efficiency and may harm the overall customer experience.

Gonzalo Chebi, PhD

On the customer side, pauses in the middle of the conversation likely have negative consequences. Our data consistently shows that conversations with longer response times are associated with a lower Customer Satisfaction (CSAT) score, as we show in Figure 3 (the CSAT is on a scale from 1 to 5 here). In addition to waiting through pauses, the overall time the customer (and the agent) spend on the call is longer.

Figure 3: We calculated the longest pause in each conversation and bucketed this variable with a bucket size of 5 seconds. Each point represents one bucket: the x-axis corresponds to the lower end of the bucket range and the y-axis corresponds to the average CSAT score for all the conversations in that bucket.

The value of automating call summaries

We already showed that taking notes during the call does not improve agent efficiency and may harm the overall customer experience. On the other hand, automating conversation summaries can be a way to reduce or completely eliminate dispositioning time for the agents as well as increase the general quality of the summaries.

The customer in this case study is initially making the automated summaries visible in their agent desk, enabling the agents to review and edit.

This has significantly reduced the time agents devote to this task.

As confidence in the AutoSummary model grows companies may opt to remove manual reviews completely from agents’ task list—and take the additional efficiency gains available. Other customers bypass this step and use AutoSummary without any agent engagement from the start.

Measuring Success

Why your care strategy must consider issue complexity and urgency

Rachel Knaster
Apr 1
2 mins

A common trait of people working in technology is a desire to be able to cleanly categorize information, data, issues, etc. To be able to delineate between one bucket and another. We see this manifest in how companies think about customer conversations—should a conversation be automated? Yes or no? Does the customer need a live engagement for the entirety of the conversation vs. more asynchronous? Yes or no? But customer conversations aren’t actually so clearcut, and the needs don’t stay consistent as conversations and customer journeys go on.

At ASAPP, we have developed a fairly unique way of thinking about conversations. Rather than relying on a single intent to determine how the entire conversation should be handled, let’s look at each turn of the conversation to better inform what the next step should be. Every request has different needs, which change considerably based on various factors.

The above graph provides a nice illustration of how we can think about the issues. Along the y-axis, you have more complex vs. more simple interactions. At the bottom, you have conversations that are well served to be fully automated without any agent intervention. On the top, you have the opposite—conversations that benefit from having a skilled agent along for the ride. But those are extremes, most conversations fall in between the two, they require some human involvement and a bunch of automation. By thinking in very binary terms, automated or not automated, you lose out on all of the opportunities to reduce agent workload on a conversation by 20%, by 50%, by 75%. By treating each piece of a conversation as worthy of its own classification and diagnosis, you bring a lot of efficiency back into your business without risking frustrating your customer.

Now the x-axis, here we’re thinking about how routine vs. how urgent the issue is. It’s easy to think “we can serve customers asynchronously, they send an SMS. We get back to them when we get back to them, just like customers are used to interacting with friends and family.” But that leaves out a very important part of the picture. While many conversations are routine and can benefit from more asynchronous interactions, allowing companies to load balance workload on agents, there are cases where customers need urgent help—make a change to a flight about to take off, help resolve billing issue just before superbowl kick off, and in those cases, you don’t want to risk a customer not getting a response in time, especially not when so many conversions didn’t need that live resolution. Then there are cases just as with complexity vs. simplicity that are in between—an initial response might need help from a live agent, cutting off access to a bank account in the case of fraud, but the follow-ups and resolutions are well-served for asynchronous communication.

Rachel Knaster
Customer interactions require different levels of attention. From simple routine issues to urgent complex requests, organizations must be able to seamlessly support every type of need, in the most efficient way possible, using the right mix of agent and automation.

Rachel Knaster

In addition to the content of what the customer is asking about, it’s important to take in every parameter you know about them and the context surrounding their issue. This goes far beyond simple intent classification. In order to determine the type of service customers need, you need to look at the entire weight of their requests. The best way to think about it is along axes of complexity and urgency.

Based on where they fall on this graph, customer interactions require different levels of attention. From simple routine issues (C) to urgent complex requests (A), organizations must be able to seamlessly support every type of need, in the most efficient way possible, using the right mix of agent and automation.

Is the customer’s question simple to solve? Then let’s automate it.

Is it complicated? Then let’s connect them with our frontline and have those agents do what they do best.

Is the issue one that can wait for an answer and more asynchronous by nature? Then let’s treat it that way.

Or is a customer’s flight about to take off and they need help? Let’s immediately connect them with someone.

These are fundamental questions contact centers should consider with every incoming request. There’s “no one size fits all” when it comes to CX strategy. Every interaction requires a different approach. so you can maximize throughput while keeping each customer satisfied.

Consider the graph above. Each quadrant represents a different category of request with its own unique considerations. In each case, the right mixture of live agent and AI, synchronous and asynchronous support can help solve the issue in the most optimal way possible. Here’s the ideal for each:

  1. Complex, urgent
  2. Agent-based, synchronous
  3. Low agent concurrency
  4. Automate part of agent workload
  5. Opportunity to mix voice and digital in same live conversation for faster resolution
  6. Complex, routine
  7. Agent-based, asynchronous
  8. Automate part of agent workload
  9. High agent concurrency
  10. Handoff to phone if required
  11. Simple, routine
  12. Fully automated interaction
  13. Low cost to serve
  14. Simple, urgent
  15. Fully automated, with fast escalation to live agent
  16. Complete history (context) of interaction required for agent
  17. Medium-high agent concurrency
  18. Automate part of agent workload
  19. Opportunity to mix voice and digital in same live conversation for faster resolution

While companies might prefer everything be automated or self service, that’s not always the most efficient way to solve an issue. Of course, neither is having your agents occupied addressing routine tasks all day. What’s needed is the right balance between the two—AI enhancing human performance so agents can handle more tasks and fully concentrate on those that need it. This is where more sophisticated machine learning offers incredible value.

There is an opportunity for AI to assist in every interaction, whether it’s handling the entire request or just part of the workload. While typically considered most helpful for automating simple tasks, the right AI models will improve over time, learning from customer interactions to assist with increasingly complex issues.

A single conversation can also become more simple or complex as it evolves, calling for changing levels of agent attention. For instance, now that the primary issue has been resolved, can the rest of this interaction be automated? Or has the issue escalated from automation to the need for an agent? Instant intent analysis provided by machine learning can help identify these occurrences to further optimize agent concurrency.

The truth is, sometimes the best thing is to have an agent live with just one customer, and sometimes it’s to have them handling multiple conversations. What’s important is for each organization to recognize the nuance and to build flexible solutions that adapt for the best outcomes to ensure operational performance is being enhanced, while never compromising on a personalized and connected experience for customers.

R&D Innovations

How to Understand Different Levels of AI Systems

Michael Griffiths
Mar 11
2 mins

AI systems have additional considerations over traditional software. A key difference is in the maintenance cost. Most of the cost of an AI system happens after the code has been deployed. ML models degrade over time without ongoing investment in data and hyperparameter tuning.

The cost structure of AI systems are directly affected by these design decisions; the level of service, and improvement over time are categorically different across different levels. Knowing the level of the AI system can help practitioners and customers predict how the system will change over time – whether it will continuously improve, remain the same, or even degrade.
Levels of AI Systems start at traditional software (Level 0) and progress up to fully Intelligent software (Level 4). Systems at Level 4 essentially maintain and improve on their own – they require negligible work. At ASAPP we call Level 4 AI Native®.

Moving up a level has trade-offs for practitioners and customers. For example, moving from Level 1 to Level 2 reduces ongoing data requirements and customization work, but introduces a self-reinforcing bias problem that could cause the system to degrade over time. Choosing to move up a level requires practitioners to recognize the new challenges, and the actions to take in designing an AI system.

While there are significant benefits in scalability (and typically performance/robustness/etc) in moving up levels, it’s important to say that most systems are best designed at Level 0 or Level 1. These levels are the most predictable: performance should remain roughly stable over time, and there are obvious mechanisms to improve performance (e.g. for Level 1, add more annotated training data).

AI Levels

Designing AI systems is different from traditional software development, because the behavior of the system is learned – and can potentially change over time once deployed. When practitioners build AI systems, it can be useful to talk about their “level”, just like SAE has levels for self-driving cars.

Michael Griffiths
Moving up a level has trade-offs for practitioners and customers. This requires practitioners to recognize the new challenges, and the actions to take in designing an AI system

Michael Griffiths

Level 0: Deterministic

No required training data, no required testing data

Algorithms that involve no learning (e.g. adapting parameters to data) are at level zero.
The great benefit of level 0 (traditional algorithms in computer science) is that they are very reliable and, if you solve the problem, can be shown to be the optimal solution. If you can solve a problem at level 0 it’s hard to beat. In some respect, all algorithms–even sorting algorithms (like binary search) – are “adaptive” to the data. We do not generally consider sorting algorithms to be “learning”. Learning involves memory–the system changing how it behaves in the future, based on what it’s learned in the past.

However, some problems defy a pre-specified algorithmic solution. The downside is that for problems that defy human understanding (either once, or in number) it can be difficult to perform well (e.g. speech to text, translation, image recognition, utterance suggestion, etc.).


  • Luhn Algorithm for credit card validation
  • Regex-based systems (e.g. simple redaction systems for credit card numbers).
  • Information retrieval algorithms like TFIDF retrieval or BM25.
  • Dictionary-based spell correction.

Note: In some cases, there can be a small number of parameters to tune. For example, ElasticSearch provides the ability to modify BM25 parameters. We can regard these as tuning parameters, i.e. set and forget. This is a blurry line.

Level 1: Learned
Static training data, static testing data

Systems where you train the model in an offline setting and deploy to production with “frozen” weights. There may be an updating cadence to the model (e.g. adding more annotated data), but the environment the model operates in does not affect the model.

The benefit of level 1 is that you can learn and deploy any function at the modest cost of some training data. This is a great place to experiment with different types of solutions. And, for problems with common elements (e.g. speech recognition) you can benefit from diminishing marginal costs.

The downside is that customization to a single use case is linear in their number: you need to curate training data for each use case. And that can change over time, so you need to continuously add annotations to preserve performance. This cost can be hard to bear.


  • Custom text classification models
  • Speech to text (acoustic model)

Level 2: Self-learning

Dynamic + static training data, static testing data

Systems that use training data generated from the system for the model to improve. In some cases, the data generation is independent of the model (so we expect increasing model performance over time as more data is added); in other cases, the model intervening can reinforce model biases and performance can get worse over time. To eliminate the chance of reinforcing biases, practitioners need to evaluate new models on static (potentially annotated) data sets.

Level 2 is great because performance seems to improve over time for free. The downside is that, left unattended, the system can get worse – it may not be consistent in getting better with more data. The other limitation is that some systems at level two might have limited capacity to improve as they essentially feed on themselves (generating their own training data); addressing this bias can be challenging.


  • Naive spam filters
  • Common speech to text models (language model)

Level 3: Autonomous (or self-correcting)

Dynamic training data, dynamic test data

Systems that both alter human behavior (e.g. recommend an action and let the user opt-in) and learn directly from that behavior, including how the systems’ choice changes the user behavior. Moving from Level 2 to 3 potentially represents a big increase in system reliability and total achievable performance.

Level 3 is great because it can consistently get better over time. However, it is more complex: it might require truly staggering amounts of data, or a very carefully designed setup, to do better than simpler systems; its ability to adapt to the environment also makes it very hard to debug. It is also possible to have truly catastrophic feedback loops. For example, a human corrects an email spam filter – however, because the human can only ever correct misclassifications that the system made, it learns that all its predictions are wrong and inverts its own predictions.

Level 4: Intelligent (or globally optimizing)

Dynamic training data, dynamic test data, dynamic goal

Systems that both dynamically interact with an environment and globally optimizes (e.g. towards some set of downstream objectives), e.g. facilitating an agent while optimizing for AHT and CSAT, or optimizing directly for profit. For example, an AutoCompose system that optimizes for the best series of clicks to optimize the conversation.

Level 4 can be very attractive. However, it is not always obvious how to get there, and unless carefully designed, these systems can optimize towards degenerate solutions. Aiming them at the right problem, shaping the reward, and auditing its behavior are large and non-trivial tasks.

Why consider levels?

Designing and building AI systems is difficult. A core part of that difficulty is understanding how they change over time (or don’t change!): how the performance, and maintenance cost, of the system will develop.

In general, there is increasing value as you move up levels, e.g. one goal might be to move a system operating at Level 1 to be at Level 2 – but complexity (and cost) of system build also increases as levels go up. It can make a lot of sense to start with a novel feature at a “low” level, where the system behavior is well understood, and progressively increase the level – as understanding the failure cases of the system becomes more difficult as the level increases.

The focus should be on learning about the problem and the solution space. Lower levels are more consistent and can be much better avenues to explore possible solutions than higher levels, whose cost and variability in performance can be large hindrances.
This set of levels provides some core breakpoints for how different AI systems can behave. Employing these levels – and making trade-offs between levels – can help provide a shorthand for differences post-deployment.

Matrix Layout

AI Research

Wav2vec could be more efficient, so we created our own pre-trained ASR Model for better Conversational AI.

Felix Wu
Feb 3
2 mins

In recent years, research efforts in natural language processing and computer vision have worked to improve the efficiency of pre-trained models to avoid the financial and environmental costs associated with training and fine-tuning them. For whatever reason, we have not seen such efforts in speech. In addition to saving costs associated with more efficient training of pre-trained models, for speech, efficiency gains could also mean greater performance for similar inference times.

Today, Wav2vec 2.0 (W2V2) is arguably the most popular approach for using self-supervised training in speech. It has received a lot of attention and follow-up works for applying pre-trained W2V2 models to various downstream applications including speech-to-text translation (Wang et al., 2021) and named entity recognition (Shon et al., 2021). Yet, we hypothesize that there are many sub-optimal design choices in the model architecture that make it relatively inefficient. To justify this hypothesis, we conducted a series of experiments on different components of the W2V2 model architecture and exposed the performance-efficiency tradeoff of the W2V2 model design space. Higher performance (lower word error rate in ASR) requires a large pre-trained model and comes with lower efficiency (inference speed). Can we achieve a better tradeoff (similar performance with higher inference speed)?

What do we propose instead? A more efficient pre-trained model that also achieves better performance through its efficiency gains.

Squeezed and Efficient Wav2vec (SEW)

Based on our observations, we propose SEW (Squeezed and Efficient Wav2vec) and SEW-D (SEW with Disentangled attention) which can achieve a much better performance-efficiency tradeoff—with 1.9x speedup during inference, our smaller SEW-D-mid achieves 13.5% WERR (word error rate reduction) compared to W2V2-base on academic datasets. Our larger SEW-D-base+ model performs close to W2V2-large while operating at the same speed as W2V2-base. It only takes 1/4 of the training epochs to outperform W2V2-base which significantly reduces the pre-training cost.

SEW differs from conventional W2V2 models in three major modifications.

First, we introduce a compact waveform feature extractor which allocates the computation across layers more evenly. This makes the model faster without sacrificing performance.

  1. Second, we propose a “squeeze context network” which downsamples the audio sequence and reduces the computation and memory usage.
  2. This allows us to use a larger model without sacrificing inference speed.
  3. Third, we introduce MLP predictor heads during pre-training which improve the performance without any overhead in the downstream application since they will be discarded after pre-training.

SEW-D further replaces the normal self-attention with disentangled self-attention proposed in DeBERTa (He et al., 2020) which achieves better performance with half of the number of parameters and a significant reduction in both inference time and memory footprint.

The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality. The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.

Anton Lozhkov

Why it matters

These pre-trained models open the door for cost savings and/or performance gains for a number of downstream models in automatic speech recognition, speaker identification, intent classification, emotion recognition, sentiment analysis and named entity recognition. The speedup of a pre-trained model can be directly transferred to the downstream models. Because the pre-trained model is smaller and faster, the fine-tuned downstream model is also smaller and faster. These efficiency gains not only reduce their training/fine-tuning time but also the actual observed latency in products. Conversational AI systems using the SEW pre-trained models will be able to better detect what consumers are saying, who’s saying what, how they feel, and to provide faster response times.

“The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality,” explains Anton Lozhkov, Machine Learning Engineer at Hugging Face. “The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.”

Want to utilize the pre-trained models from ASAPP? See our paper and open source code for more details. Moreover, our pre-trained models are now available in Hugging Face’s transformers library and model hub. Our paper is accepted and will appear at ICASSP 2022. Please feel free to reach out to the authors in the post-session during the conference.

No results found.
No items found.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get Started

AI Services Value Calculator

Estimate your cost savings

contact us

Request a Demo

Transform your enterprise with generative AI • Optimize and grow your CX •
Transform your enterprise with generative AI • Optimize and grow your CX •