Wav2vec could be more efficient, so we created our own pre-trained ASR Model for better Conversational AI.
In recent years, research efforts in natural language processing and computer vision have worked to improve the efficiency of pre-trained models to avoid the financial and environmental costs associated with training and fine-tuning them. For whatever reason, we have not seen such efforts in speech. In addition to saving costs associated with more efficient training of pre-trained models, for speech, efficiency gains could also mean greater performance for similar inference times.
Today, Wav2vec 2.0 (W2V2) is arguably the most popular approach for using self-supervised training in speech. It has received a lot of attention and follow-up works for applying pre-trained W2V2 models to various downstream applications including speech-to-text translation (Wang et al., 2021) and named entity recognition (Shon et al., 2021). Yet, we hypothesize that there are many sub-optimal design choices in the model architecture that make it relatively inefficient. To justify this hypothesis, we conducted a series of experiments on different components of the W2V2 model architecture and exposed the performance-efficiency tradeoff of the W2V2 model design space. Higher performance (lower word error rate in ASR) requires a large pre-trained model and comes with lower efficiency (inference speed). Can we achieve a better tradeoff (similar performance with higher inference speed)?
What do we propose instead? A more efficient pre-trained model that also achieves better performance through its efficiency gains.
Squeezed and Efficient Wav2vec (SEW)
Based on our observations, we propose SEW (Squeezed and Efficient Wav2vec) and SEW-D (SEW with Disentangled attention) which can achieve a much better performance-efficiency tradeoff—with 1.9x speedup during inference, our smaller SEW-D-mid achieves 13.5% WERR (word error rate reduction) compared to W2V2-base on academic datasets. Our larger SEW-D-base+ model performs close to W2V2-large while operating at the same speed as W2V2-base. It only takes 1/4 of the training epochs to outperform W2V2-base which significantly reduces the pre-training cost.
SEW differs from conventional W2V2 models in three major modifications.
First, we introduce a compact waveform feature extractor which allocates the computation across layers more evenly. This makes the model faster without sacrificing performance.
- Second, we propose a “squeeze context network” which downsamples the audio sequence and reduces the computation and memory usage.
- This allows us to use a larger model without sacrificing inference speed.
- Third, we introduce MLP predictor heads during pre-training which improve the performance without any overhead in the downstream application since they will be discarded after pre-training.
SEW-D further replaces the normal self-attention with disentangled self-attention proposed in DeBERTa (He et al., 2020) which achieves better performance with half of the number of parameters and a significant reduction in both inference time and memory footprint.
The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality. The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.
Why it matters
These pre-trained models open the door for cost savings and/or performance gains for a number of downstream models in automatic speech recognition, speaker identification, intent classification, emotion recognition, sentiment analysis and named entity recognition. The speedup of a pre-trained model can be directly transferred to the downstream models. Because the pre-trained model is smaller and faster, the fine-tuned downstream model is also smaller and faster. These efficiency gains not only reduce their training/fine-tuning time but also the actual observed latency in products. Conversational AI systems using the SEW pre-trained models will be able to better detect what consumers are saying, who’s saying what, how they feel, and to provide faster response times.
“The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality,” explains Anton Lozhkov, Machine Learning Engineer at Hugging Face. “The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.”
Want to utilize the pre-trained models from ASAPP? See our paper and open source code for more details. Moreover, our pre-trained models are now available in Hugging Face’s transformers library and model hub. Our paper is accepted and will appear at ICASSP 2022. Please feel free to reach out to the authors in the post-session during the conference.