Automatic Speech Recognition (ASR), as its name indicates, is a technology tasked with deriving text transcriptions from auditory speech. As such, ASR is the backbone that provides real-time transcriptions for downstream tasks. This includes critical machine learning (ML) and natural language processing (NLP) tools that help human agents reach optimal performance. Downstream ML/NLP examples include auto-suggest features for an agent based on what a customer is saying during a call, creating post-call summary notes from what was said, or intent classification, i.e. knowing what a customer is calling for to pair them with the most appropriate agent. Crucial to the success of these AI systems is the accuracy of speech transcriptions. Only by accurately detecting what a customer or agent is saying in real-time, can we have AI systems provide insights or automate tasks accordingly.
A key way to improve this accuracy is to provide more surrounding speech information to the ASR model. Rather than having an ASR model predict what a speaker is saying only based on what’s said before, by also using what’ll be said as future context, is a model able to better predict and detect the difference between someone who said: “I’m going to the cinema today [to watch the new James Bond]” versus “I’m going to the cinema to date… [James Bond].” When we predict words, using speech frames from future utterances gives more context. And by utilizing more context, some of the errors which emerge from the limitation of the method relying on past context only can be fixed.
A downside to the increased accuracy of the longer contextual approach with future speech frames comes with a trade-off in latency and speed for waiting and computing future frames. Latency constraints vary depending upon services and applications. People usually train the best model at a given latency requirement. You would compromise the accuracy of an ASR model if it were used in a different latency condition from the one incurred to model training. To meet various scenarios or service requirements with this approach thus means that several different models would have to be trained separately – making development and maintenance difficult, which is a scalability issue.
At ASAPP, we require the highest accuracy and lowest latency to achieve true real-time insights and automation. However, given our diverse product offerings with different latency requirements, we also try to address the scalability issue efficiently. So to overcome this challenge, research accepted at Interspeech 2021 takes a new approach with an ASR model that dynamically adjusts its latency based on different constraints without the accuracy compromise, which we refer to as Multi-mode ASR.
The ASAPP Research: A Multi-mode Transformer Transducer with Stochastic Future Context
Our work expands upon previous research on dual-mode ASR (Yu et al., 2020). A Transformer model has the same structure for both the full context model and the streaming model: the full context model uses unlimited future context and the streaming model uses limited future context (e.g., 0 or 1 future speech frame per each neural layer, where a frame requires 10ms speech and we use 12 layers). The only difference is that self-attention controls how many future frames the model would access by masking the frames. Therefore, it is possible to operate the same model in full context and streaming mode. Additionally, we can use “knowledge distillation” when training the streaming mode. That is, we train the streaming mode not only on its original objective, but also to have outputs that are similar to the ones produced by the full context mode. This way, we can further bridge the gap between streaming and full context modes. This method remarkably improves the problem of accuracy drop and alignment delay of streaming ASR. We were directly motivated by this method and have been studying to extend it to multiple modes.
Our multi-mode ASR is similar to dual-mode but it is broader and more general. We didn’t limit the streaming mode to a single configuration using only one future context size, but defined it as using a stochastic future context size instead. As described in Figure 1 below, dual-mode ASR is trained on a predefined pair consisting of the full context mode and the zero context (streaming) mode. In contrast, multi-mode ASR trains a model using multiple pairs of the full context mode and the streaming mode with a future context size of C where C is sampled from a stochastic distribution.
Since C is selected from a distribution for every single minibatch during training, a single model is trained on various future context conditions.
We say that evaluation conditions are matched when the training context size and the inference context size are the same, and that they are mismatched otherwise. The results in Table 1 show that a streaming model only works well when it’s matched, i.e., trained and evaluated on past speech alone. . Although the results for the dual-mode trained model are better than the trained-alone model – a result of the knowledge distillation, it also doesn’t work well in the mismatched condition. Contrary to this, it can be confirmed that our proposed multi-mode trained model operates reliably in multiple conditions, because the mismatched condition is eliminated by using a stochastic future context. Looking at the detailed results for each context condition, it can be expected that training for this stochastic future context also can bring regularization effects to a single model.
Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.
Why this matters
ASR is used in services with various environments and scenarios. To create downstream ML and NLP tasks that produce results within seconds and work well with human workflows, ASAPP’s ASR model must similarly operate in milliseconds based on the situation. Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.
By exposing a single model to various conditions, one model can have the ability to change the amount of used future context needed to meet the latency requirements for a particular application. This makes it easier and more resource-efficient to cover all different scenarios. Thinking further, if the latency is increased due to unpredictable load in service, it is possible to change the configuration easily on the fly, and it is viable to significantly increase the usability with minimal accuracy degradation. Algorithms for responding to multiple scenarios usually suffer sub-optimal performance problems compared to a model optimized for one condition. But multi-mode ASR shows the possibility that it can easily cover multiple conditions without such problems.
What’s next for us at ASAPP
The paper about this study will be presented at Interspeech 2021 (Wed, Sep 1st, 11:00 ~ 13:00, GMT +2). The method and detailed results are described in that paper. We believe that this research topic is one of the promising directions to effectively support various applications, services, and customers. Research is also underway to extend this method to train a general model by combining it with pre-training methods. We will continue to focus on research on scalability as an important factor in terms of model training and deployment.
Kwangyoun Kim is a Senior Speech Scientist at ASAPP. His research focuses on end-to-end speech recognition technology and related algorithms, especially in model training methods. He received his B.S. and M.S. degrees in the electrical engineering from Korea University, Seoul, Korea.