ASAPP tops ASR leaderboard with E‑Branchformer
ASR technology has been beneficial for businesses and their customers for many years. ASR, or Automatic Speech Recognition, is the software that translates human speech into text. With continual advancements in research and AI modeling, accuracy has improved immensely over time. Developing the most accurate ASR possible has become a high priority for many top tech companies because of how much it benefits businesses when it’s done correctly.
ASR’s primary goal is to maintain high recognition accuracy. There are various units of evaluating recognition rates or error rates, such as phonemes, characters, words, or sentences. The most commonly used method to determine the accuracy of ASR is Word Error Rate (WER).
To fairly compare AI speech recognition research studies across the industry, we evaluate WER on public datasets. Librispeech, one of the most widely used datasets, consists of about 1000 hours of English reading speech with transcription and extra text corpus. Researchers worldwide have been competing for years to substantiate their methods’ superiority using the Librispeech dataset and WER.
Recently, the speech community has been trending towards end-to-end (E2E) modeling for ASR. Instead of having separate acoustic and language models, as in conventional ASR methods, E2E modeling has achieved great success in both efficiency and accuracy by simultaneously training a single integrated model.
Although several E2E model structures, such as Transducer and Attention-based Encoder-Decoder (AED) have been explored, most of them share a common encoder, the module that extracts meaningful representative information from the input speech.
Speech scientists, looking to create a more powerful encoder, are actively studying novel training objectives, acoustic feature engineering, data augmentation methods, and self-supervised learning using untranscribed speech.
But these research areas don’t address a fundamental question, “What is the optimal neural network architecture for constructing the encoder?”
To address this question, ASAPP researchers recently developed the E-Branchformer model, a highly accurate neural network. Other similar models include Transformer, Conformer, and Branchformer; however, the E-Branchformer surpasses these models in accuracy. Here’s a quick overview of the different models ASAPP used to develop E-Branchformer.
The Transformer has shown promising performance in several sequence modeling tasks for ASR and NLU (natural language understanding). This potential is due to the strength of multi-headed self-attention, which can effectively extract meaningful information from the input sequence, while considering the global context.
To improve the Transformer, many methods have been introduced and utilized to create synergy by applying convolution, which has advantages in modeling the local context.
In particular, Conformer was introduced and is widely considered as the state-of-the-art accuracy in Librispeech ASR tasks.
By evaluating with an external Language Model (LM) trained using Librispeech text corpus, Conformer achieves 1.9% and 3.9% WER on test-clean and test-others, respectively. Although Conformer demonstrates that stacking convolution sequentially after self-attention is a better method than using them in parallel, other research studies, like Branchformer, have applied these two neural networks in parallel, and found performance to be noticeable.
Branchformer was introduced with three main components:
- Local-context branch using MLP with convolutional gating (cgMLP)
- Global-context branch using multi-headed self-attention
- Merging the module with a linear projection layer
Each branch is computed in parallel before results are merged. Through intensive experiments, Branchformer showed comparable performance with Conformer. Other experiments stacked different combinations by mixing Branchformer and Conformer, but didn’t achieve better results.
Inspired by the Branchformer studies, ASAPP researched how convolution and self-attention can be combined more effectively and efficiently.
This resulted in the highest performing model, E-Branchformer, setting the new state-of-the-art WER at 1.81% and 3.65% on Librispeech test-clean and test-other with an external LM.
To develop E-Branchformer, we made two primary contributions to Branchformer that significantly improved performance.
- We enhanced the merging module, which combines the output of the global and local branches, by introducing additional convolutions. This change has the effect of combining self-attention with convolution sequentially and in parallel. Through extensive experiments on several types of merge modules, we proved that adding a single depth-wise convolution can significantly improve accuracy with negligible computational increase.
- We revisited the point-wise Feed-Forward Network (FFN). Transformer and its variants commonly stack FFN with self-attention in an interleaving pattern. We experimentally demonstrated that even in Branchformer, stacking FFN together is more effective in improving the model’s capacity. For example, we found that a stack of 17 Branchformers and 17 FFNs in an interleaving pattern has a similar model size to the 25 Branchformers, but is much more advantageous in accuracy.
ASAPP has topped the leaderboard of WER in Librispeech ASR tasks by using the newly proposed E-Branchformer. We are confident that this new model structure can be applied to other tasks and achieve impressive results.
We’re sharing our findings with the community so that everyone can benefit from them. You’ll be able to find all of the detailed methods and experimental results in our upcoming paper, which has been accepted and will be presented at SLT 2022. We’ll also release more information about how we implemented E-Branchformer. Our models’ recipes will be available through ESPnet, so anyone who wants to can reproduce our results. If you’d like to talk about E-Branchformer in person, please reach out to us during the SLT 2022 conference.