Felix Wu

Felix Wu, PhD is a Research Scientist at ASAPP. He received his Ph.D. in Computer Science from Cornell University under the supervision of Prof. Kilian Q. Weinberger and his B.S. in Computer Science and Information Engineering from National Taiwan University. His research interest includes Machine Learning and its applications such as Natural Language Processing and Computer Vision. Recently, he is focusing on designing efficient neural models.

Wav2vec could be more efficient, so we created our own pre-trained ASR Model for better Conversational AI.

Felix Wu

In recent years, research efforts in natural language processing and computer vision have worked to improve the efficiency of pre-trained models to avoid the financial and environmental costs associated with training and fine-tuning them. For whatever reason, we have not seen such efforts in speech. In addition to saving costs associated with more efficient training of pre-trained models, for speech, efficiency gains could also mean greater performance for similar inference times.

Today, Wav2vec 2.0 (W2V2) is arguably the most popular approach for using self-supervised training in speech. It has received a lot of attention and follow-up works for applying pre-trained W2V2 models to various downstream applications including speech-to-text translation (Wang et al., 2021) and named entity recognition (Shon et al., 2021). Yet, we hypothesize that there are many sub-optimal design choices in the model architecture that make it relatively inefficient. To justify this hypothesis, we conducted a series of experiments on different components of the W2V2 model architecture and exposed the performance-efficiency tradeoff of the W2V2 model design space. Higher performance (lower word error rate in ASR) requires a large pre-trained model and comes with lower efficiency (inference speed). Can we achieve a better tradeoff (similar performance with higher inference speed)?

What do we propose instead? A more efficient pre-trained model that also achieves better performance through its efficiency gains.

Squeezed and Efficient Wav2vec (SEW)

Based on our observations, we propose SEW (Squeezed and Efficient Wav2vec) and SEW-D (SEW with Disentangled attention) which can achieve a much better performance-efficiency tradeoff—with 1.9x speedup during inference, our smaller SEW-D-mid achieves 13.5% WERR (word error rate reduction) compared to W2V2-base on academic datasets. Our larger SEW-D-base+ model performs close to W2V2-large while operating at the same speed as W2V2-base. It only takes 1/4 of the training epochs to outperform W2V2-base which significantly reduces the pre-training cost.

SEW differs from conventional W2V2 models in three major modifications.

First, we introduce a compact waveform feature extractor which allocates the computation across layers more evenly. This makes the model faster without sacrificing performance.

Second, we propose a “squeeze context network” which downsamples the audio sequence and reduces the computation and memory usage.
This allows us to use a larger model without sacrificing inference speed.
Third, we introduce MLP predictor heads during pre-training which improve the performance without any overhead in the downstream application since they will be discarded after pre-training.

SEW-D further replaces the normal self-attention with disentangled self-attention proposed in DeBERTa (He et al., 2020) which achieves better performance with half of the number of parameters and a significant reduction in both inference time and memory footprint.

‍

The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality. The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.

Anton Lozhkov

‍

Why it matters

These pre-trained models open the door for cost savings and/or performance gains for a number of downstream models in automatic speech recognition, speaker identification, intent classification, emotion recognition, sentiment analysis and named entity recognition. The speedup of a pre-trained model can be directly transferred to the downstream models. Because the pre-trained model is smaller and faster, the fine-tuned downstream model is also smaller and faster. These efficiency gains not only reduce their training/fine-tuning time but also the actual observed latency in products. Conversational AI systems using the SEW pre-trained models will be able to better detect what consumers are saying, who’s saying what, how they feel, and to provide faster response times.

“The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality,” explains Anton Lozhkov, Machine Learning Engineer at Hugging Face. “The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.”

Want to utilize the pre-trained models from ASAPP? See our paper and open source code for more details. Moreover, our pre-trained models are now available in Hugging Face’s transformers library and model hub. Our paper is accepted and will appear at ICASSP 2022. Please feel free to reach out to the authors in the post-session during the conference.

‍

Addressing instabilities for few-sample BERT fine-tuning

The costs of BERT Fine-Tuning on small datasets

Fine-tuning BERT or its variants has become one of the most popular and effective methods to tackle natural language processing tasks, especially those with limited data. BERT models have been downloaded more than 5.6 millions of times from Huggingface’s public server.

However, fine-tuning remains unstable, especially when
using the large variant of BERT (BERTLarge) on small datasets, arguably the most impactful use of BERT-style models. Identical learning processes with different random seeds often result in significantly different and sometimes degenerate models following fine-tuning, even though only a few, seemingly insignificant aspects of the learning process are impacted by the random seed (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020). In layman’s terms: every time you train BERT for your task, you get different results. This means you need to train again and again to get a good system. This makes scientific comparison challenging (Dodge et al., 2020) and creates huge costs, which are potentially unnecessary.

While the variance comes from randomness, we hypothesize that the major cause of this instability lies in the optimization process.

Revisiting Few-sample BERT Fine-tuning

We conducted an extensive empirical analysis of BERT fine-tuning optimization behaviors on three aspects to identify the root cause of instability:

The Optimization Algorithm
We found that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the main cause of degenerate models.
The Initialization
We found that re-initializing the top few layers of BERT stabilizes the fine-tuning procedure.
The Number of Training Iterations
We found that the model still requires hundreds of updates to converge.

1. Optimization Algorithm

We observed that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the lead cause of degenerate fine-tuning runs. The following is a pseudo-code of the Adam algorithm (Kingma & Ba, 2014). BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

ASAPP—BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

Fine-tuning BERT with the original Adam (with bias correction) eradicates almost all degenerate model training outcomes and reduces the variance across multiple randomized trials. Here, we show the test performance distribution of 20 random trials with or without bias correction on four small datasets.

ASAPP—Here, we show the test performance distribution of 20 random trails with or without bias correction on four small datasets.

Since the variance is significantly reduced, practitioners can easily get a decent model within only one to five trials instead of fine-tuning up to 20 models and picking the best one.

2. Initialization

We hypothesized that the top pre-trained layers of BERT are specific to the pre-training task and may not transfer to a dissimilar downstream task. We propose to re-initialize the top few layers of BERT to ease the fine-tuning procedure. We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers.

ASAPP—We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers

The following figure shows the validation performance with different numbers of re-initialized layers. As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks.

ASAPP—As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks

3. Number of Training Iterations

ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge.

Revisiting Existing Methods for Few-sample BERT Fine-tuning

Instability in BERT fine-tuning, especially in few-sample settings, has been receiving significant attention recently. We revisited these methods given our analysis of the fine-tuning process, focusing on the impact of using the debiased Adam instead of BERTAdam.

To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. “Int. Task” stands for transferring via an intermediate task (MNLI), “LLRD” stands for layerwise learning rate decay, “WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

ASAPP—To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. ``Int. Task” stands for transferring via an intermediate task (MNLI), ``LLRD” stands for layerwise learning rate decay, ``WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

We found that the standard fine-tuning procedure using bias-corrected Adam already has a fairly small variance, making these more complex techniques largely unnecessary. Moreover, re-initialization and training longer can serve as simple yet hard to beat baselines that outperforms previous methods except “Int. Task’’ on RTE. The reason is that RTE is very similar to MNLI (the intermediate task).

Why this work matters

This work carefully investigates the current, broadly adopted optimization practices in BERT fine-tuning. Our findings significantly stabilize BERT fine-tuning on small datasets. Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Our findings are focused on few-sample training scenarios, which opens, or at least eases the way for new applications at reduced data costs. The reduction in cost broadens the accessibility and reduces the energy footprint of BERT-based models. Applications that require frequent re-training are now easier and cheaper to deploy given the reduced training costs. This work also simplifies the scientific comparison between future fine-tuning methods by making training more stable, and therefore easier to reproduce.

Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Felix Wu, PhD

Read The Complete Paper:

‍

This work has been accepted and will be published in ICLR 2021. Visit our poster during the virtual conference—Poster Session 2: May 3, 2021, 9 a.m. PDT & May 3, 2021, 11 a.m. PDT—to have some conversations with the authors.

Recently Published

Browse Blog

For AI agent solutions, infrastructure that allows you to test, train, monitor, and govern your AI agents before and after they go live are must-haves. Here are the three capabilities to start with.

Discover 8 high-impact use cases for AI agents in retail CX—from returns to upsells—and how to boost service speed, scale, and satisfaction.

Learn what CX leaders in financial services should expect from AI agents—and why safety and security must go far beyond the basics.

The challenge isn’t cost—it’s trust. GenerativeAgent delivers enterprise-ready AI with tools for safe testing, human review, and live monitoring.

Measuring GenAI agents isn’t about sounding human. It’s about outcomes. Here’s what to track to protect your brand and bottom line.

A generative AI agent isn’t built to mimic humans—it’s built to deliver faster, safer, more consistent results in customer service.

How Assurant is using generative AI to boost CX, empower agents, and move toward agentic AI—starting with strategy, not shortcuts.

Discover 6 powerful use cases for AI agents in financial services to boost customer service, cut costs, and scale support with confidence.

Get full visibility into GenerativeAgent’s performance with tools that surface issues, show decision paths, and drive scalable CX quality and ROI.

How Tangerine Bank is using AI to boost CX, empower agents, and redefine the digital contact center—without losing the human touch.

Is your AI agent saving you money—or just creating the illusion of efficiency? Learn how to measure real impact with the metrics that matter.

Discover how autonomous AI agents solve key retail contact center challenges—scaling service, cutting costs, and improving customer experience.

Browse Blog