Automatic Speech Recognition (ASR) has been a cornerstone capability for voice contact centers for years, enabling agents to review what was just said, or to review older calls to better gather context, and facilitating a whole suite of quality assurance and analytics capabilities. Because ASAPP specializes in serving large enterprise customers with a plethora of data, we’re always looking for ways to improve the scalability and performance of our speech-to-text models; even small wins in accuracy, for example, can translate into huge gains for our customers. Accordingly, we’ve recently made a strategic switch from a hybrid ASR architecture to a more powerful end-to-end neural model. Since adopting this new model we’ve been able to reduce the lower median latency of our model by over 50%, increase the accuracy, and lower the cost of running the model.
To understand why we made this strategic technological shift and how we achieved these results it helps to understand the status quo in real time transcription for contact centers. Often a hybrid model is used which combines separate complementary components. The first component is an acoustic model that translates the raw audio signal into phonemes, the basic units of human speech. Unfortunately the audio data alone can’t be used to construct a sentence of words, since phonemes can be combined in many different ways to construct words. To solve these ambiguities, a lexicon is used to map phonemes to possible words, and a third component, a language model, picks the most likely phrase or sentence from several candidates. This type of pipeline of separate components has been used for decades.
While hybrid architectures have been the standard, they have their limitations. First, because the acoustic model has been trained separately from the language model, they are not quite as powerful as a single larger model. In our new end-to-end architecture, the encoder gives a richer piece of data to the decoder than just phonemes; moreover the pieces of our architecture are all trained together, so they learn to work well together.
The separation of the model components in the legacy architecture has another constraint: it starts to get diminishing returns from more data. In contrast, our new integrated architecture requires more data, but also continues to improve more dramatically as we train it on new data. In other words, this new model is better able to take advantage of the large amounts of data that we encounter working with enterprise customers. Some of this data is text without audio or vice versa and leveraging it allows us to further boost model performance without expensive transcription annotation by humans. It’s worth noting the power of modern GPUs has catalyzed the success of these new techniques, enabling these larger jointly trained models to train on larger datasets in reasonable amounts of time.
Once trained, we can tally up all the metrics and see improvements across the board: The training process is simpler and easier to scale, it’s twice as cheap, and twice as fast*. The model also balances real time demand with historical accuracy: the model waits a few milliseconds to consider audio slightly into the future, giving it more context to predict the right words in virtually real time; finally the model contains a rescoring component that utilizes a larger window of audio to commit an even more accurate transcription to the historical record. Both our real time and historical transcription capabilities are advancing the state of the art.
ASAPP E2E Performance By the Numbers
This was not an easy task. ASAPP has a world class team that continuously looks for ways to improve our speech capabilities. The nexus of GPU power, better modeling techniques, and bigger datasets reduces the need for an external language model and enables them to train the whole thing end to end. These improvements translate into better, faster models that our customers can leverage for their speech transcription needs.