Kwangyoun Kim is a Senior Speech Scientist at ASAPP. His research focuses on end-to-end speech recognition technology and related algorithms, especially in model training methods. He received his B.S. and M.S. degrees in the electrical engineering from Korea University, Seoul, Korea.
Automatic Speech Recognition (ASR), as its name indicates, is a technology tasked with deriving text transcriptions from auditory speech. As such, ASR is the backbone that provides real-time transcriptions for downstream tasks. This includes critical machine learning (ML) and natural language processing (NLP) tools that help human agents reach optimal performance. Downstream ML/NLP examples include auto-suggest features for an agent based on what a customer is saying during a call, creating post-call summary notes from what was said, or intent classification, i.e. knowing what a customer is calling for to pair them with the most appropriate agent. Crucial to the success of these AI systems is the accuracy of speech transcriptions. Only by accurately detecting what a customer or agent is saying in real-time, can we have AI systems provide insights or automate tasks accordingly.
A key way to improve this accuracy is to provide more surrounding speech information to the ASR model. Rather than having an ASR model predict what a speaker is saying only based on what’s said before, by also using what’ll be said as future context, is a model able to better predict and detect the difference between someone who said: “I’m going to the cinema today [to watch the new James Bond]” versus “I’m going to the cinema to date… [James Bond].” When we predict words, using speech frames from future utterances gives more context. And by utilizing more context, some of the errors which emerge from the limitation of the method relying on past context only can be fixed.
A downside to the increased accuracy of the longer contextual approach with future speech frames comes with a trade-off in latency and speed for waiting and computing future frames. Latency constraints vary depending upon services and applications. People usually train the best model at a given latency requirement. You would compromise the accuracy of an ASR model if it were used in a different latency condition from the one incurred to model training. To meet various scenarios or service requirements with this approach thus means that several different models would have to be trained separately—making development and maintenance difficult, which is a scalability issue.
At ASAPP, we require the highest accuracy and lowest latency to achieve true real-time insights and automation. However, given our diverse product offerings with different latency requirements, we also try to address the scalability issue efficiently. So to overcome this challenge, research accepted at Interspeech 2021 takes a new approach with an ASR model that dynamically adjusts its latency based on different constraints without the accuracy compromise, which we refer to as Multi-mode ASR.
The ASAPP Research: A Multi-mode Transformer Transducer with Stochastic Future Context
Our work expands upon previous research on dual-mode ASR (Yu et al., 2020). A Transformer model has the same structure for both the full context model and the streaming model: the full context model uses unlimited future context and the streaming model uses limited future context (e.g., 0 or 1 future speech frame per each neural layer, where a frame requires 10ms speech and we use 12 layers). The only difference is that self-attention controls how many future frames the model would access by masking the frames. Therefore, it is possible to operate the same model in full context and streaming mode. Additionally, we can use “knowledge distillation” when training the streaming mode. That is, we train the streaming mode not only on its original objective, but also to have outputs that are similar to the ones produced by the full context mode. This way, we can further bridge the gap between streaming and full context modes. This method remarkably improves the problem of accuracy drop and alignment delay of streaming ASR. We were directly motivated by this method and have been studying to extend it to multiple modes.
Our multi-mode ASR is similar to dual-mode but it is broader and more general. We didn’t limit the streaming mode to a single configuration using only one future context size, but defined it as using a stochastic future context size instead. As described in Figure 1 below, dual-mode ASR is trained on a predefined pair consisting of the full context mode and the zero context (streaming) mode. In contrast, multi-mode ASR trains a model using multiple pairs of the full context mode and the streaming mode with a future context size of C where C is sampled from a stochastic distribution.
Since C is selected from a distribution for every single minibatch during training, a single model is trained on various future context conditions.
We say that evaluation conditions are matched when the training context size and the inference context size are the same, and that they are mismatched otherwise. The results in Table 1 show that a streaming model only works well when it’s matched, i.e., trained and evaluated on past speech alone. . Although the results for the dual-mode trained model are better than the trained-alone model—a result of the knowledge distillation, it also doesn’t work well in the mismatched condition. Contrary to this, it can be confirmed that our proposed multi-mode trained model operates reliably in multiple conditions, because the mismatched condition is eliminated by using a stochastic future context. Looking at the detailed results for each context condition, it can be expected that training for this stochastic future context also can bring regularization effects to a single model.
Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.
Why this matters
ASR is used in services with various environments and scenarios. To create downstream ML and NLP tasks that produce results within seconds and work well with human workflows, ASAPP’s ASR model must similarly operate in milliseconds based on the situation. Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.
By exposing a single model to various conditions, one model can have the ability to change the amount of used future context needed to meet the latency requirements for a particular application. This makes it easier and more resource-efficient to cover all different scenarios. Thinking further, if the latency is increased due to unpredictable load in service, it is possible to change the configuration easily on the fly, and it is viable to significantly increase the usability with minimal accuracy degradation. Algorithms for responding to multiple scenarios usually suffer sub-optimal performance problems compared to a model optimized for one condition. But multi-mode ASR shows the possibility that it can easily cover multiple conditions without such problems.
What’s next for us at ASAPP
The paper about this study will be presented at Interspeech 2021 (Wed, Sep 1st, 11:00 ~ 13:00, GMT +2). The method and detailed results are described in that paper. We believe that this research topic is one of the promising directions to effectively support various applications, services, and customers. Research is also underway to extend this method to train a general model by combining it with pre-training methods. We will continue to focus on research on scalability as an important factor in terms of model training and deployment.
ASR technology has been beneficial for businesses and their customers for many years. ASR, or Automatic Speech Recognition, is the software that translates human speech into text. With continual advancements in research and AI modeling, accuracy has improved immensely over time. Developing the most accurate ASR possible has become a high priority for many top tech companies because of how much it benefits businesses when it’s done correctly.
ASR’s primary goal is to maintain high recognition accuracy. There are various units of evaluating recognition rates or error rates, such as phonemes, characters, words, or sentences. The most commonly used method to determine the accuracy of ASR is Word Error Rate (WER).
To fairly compare AI speech recognition research studies across the industry, we evaluate WER on public datasets. Librispeech, one of the most widely used datasets, consists of about 1000 hours of English reading speech with transcription and extra text corpus. Researchers worldwide have been competing for years to substantiate their methods’ superiority using the Librispeech dataset and WER.
Recently, the speech community has been trending towards end-to-end (E2E) modeling for ASR. Instead of having separate acoustic and language models, as in conventional ASR methods, E2E modeling has achieved great success in both efficiency and accuracy by simultaneously training a single integrated model.
Although several E2E model structures, such as Transducer and Attention-based Encoder-Decoder (AED) have been explored, most of them share a common encoder, the module that extracts meaningful representative information from the input speech.
Speech scientists, looking to create a more powerful encoder, are actively studying novel training objectives, acoustic feature engineering, data augmentation methods, and self-supervised learning using untranscribed speech.
But these research areas don’t address a fundamental question, “What is the optimal neural network architecture for constructing the encoder?”
To address this question, ASAPP researchers recently developed the E-Branchformer model, a highly accurate neural network. Other similar models include Transformer, Conformer, and Branchformer; however, the E-Branchformer surpasses these models in accuracy. Here’s a quick overview of the different models ASAPP used to develop E-Branchformer.
The Transformer has shown promising performance in several sequence modeling tasks for ASR and NLU (natural language understanding). This potential is due to the strength of multi-headed self-attention, which can effectively extract meaningful information from the input sequence, while considering the global context.
To improve the Transformer, many methods have been introduced and utilized to create synergy by applying convolution, which has advantages in modeling the local context. In particular, Conformer was introduced and is widely considered as the state-of-the-art accuracy in Librispeech ASR tasks.
By evaluating with an external Language Model (LM) trained using Librispeech text corpus, Conformer achieves 1.9% and 3.9% WER on test-clean and test-others, respectively. Although Conformer demonstrates that stacking convolution sequentially after self-attention is a better method than using them in parallel, other research studies, like Branchformer, have applied these two neural networks in parallel, and found performance to be noticeable.
Branchformer was introduced with three main components:
Local-context branch using MLP with convolutional gating (cgMLP)
Global-context branch using multi-headed self-attention
Merging the module with a linear projection layer
Each branch is computed in parallel before results are merged. Through intensive experiments, Branchformer showed comparable performance with Conformer. Other experiments stacked different combinations by mixing Branchformer and Conformer, but didn’t achieve better results.
Inspired by the Branchformer studies, ASAPP researched how convolution and self-attention can be combined more effectively and efficiently.
This resulted in the highest performing model, E-Branchformer, setting the new state-of-the-art WER at 1.81% and 3.65% on Librispeech test-clean and test-other with an external LM.
To develop E-Branchformer, we made two primary contributions to Branchformer that significantly improved performance.
We enhanced the merging module, which combines the output of the global and local branches, by introducing additional convolutions. This change has the effect of combining self-attention with convolution sequentially and in parallel. Through extensive experiments on several types of merge modules, we proved that adding a single depth-wise convolution can significantly improve accuracy with negligible computational increase.
We revisited the point-wise Feed-Forward Network (FFN). Transformer and its variants commonly stack FFN with self-attention in an interleaving pattern. We experimentally demonstrated that even in Branchformer, stacking FFN together is more effective in improving the model’s capacity. For example, we found that a stack of 17 Branchformers and 17 FFNs in an interleaving pattern has a similar model size to the 25 Branchformers, but is much more advantageous in accuracy.
ASAPP has topped the leaderboard of WER in Librispeech ASR tasks by using the newly proposed E-Branchformer. We are confident that this new model structure can be applied to other tasks and achieve impressive results.
We’re sharing our findings with the community so that everyone can benefit from them. You’ll be able to find all of the detailed methods and experimental results in our upcoming paper, which has been accepted and will be presented at SLT 2022. We’ll also release more information about how we implemented E-Branchformer. Our models’ recipes will be available through ESPnet, so anyone who wants to can reproduce our results. If you’d like to talk about E-Branchformer in person, please reach out to us during the SLT 2022 conference.