There is a lot of interest in automatic speech recognition (ASR) for many uses. Thanks to the recent development of efficient training mechanisms and the availability of more computing power, deep neural networks have enabled ASR systems to perform astoundingly well in a number of application domains.
At ASAPP, our focus is in augmenting human performance with AI. Today we do that in large consumer contact centers, where our customers serve consumers over both voice and digital channels. ASR is the backbone that enables us to augment agents in real-time throughout each customer interaction. We build the highest performing ASR system in the world based on industry standard benchmarks. We do this not only by leveraging the technological advancement in deep learning, but also by applying our own innovation to analyze problems at different levels of detail.
At ASAPP we continuously push the limits of what’s possible by not only leveraging technological advances in deep learning, but by also innovating. We are always looking for new ways to analyze problems and explore practical solutions at different levels of detail.
LibriSpeech, a speech corpus of 1,000 hours of transcribed audiobooks, has been adopted since its introduction in 2015 as the most used benchmark dataset for ASR research in both academia and industry. Using this dataset, many prestigious research groups around the world including ASAPP, Google Brain, and Facebook AI Research have been testing their new ideas. Never have there been more rapid advances than in the past year for the race to achieve better results on the LibriSpeech testset.
In early 2019, Google’s ASR system with a novel data augmentation method outperformed all previously existing ASR systems by a big margin, boasting a word error rate (WER) of 2.5% on the LibriSpeech test-clean set (shown in the below figure). A word error rate is the percentage of words an ASR system gets wrong, measured against a reference transcript of the given audio. Later the same year, ASAPP joined the race and gained immediate attention with a WER of 2.2%, beating the best performing system at that time by Facebook. The lead, however, didn’t last long as Google in 2020 announced a couple of new systems to reclaim the driver seat in the race, reaching a sub-2.0% WER for the first time. One week after Google’s announcement, ASAPP published a new paper highlighting a 1.75% WER (98.25% accuracy!) to regain the lead. ASAPP remains at the top of the leaderboard (as of September in 2020).
The race will continue, and so will our innovation to make the ultimate winner in this race our customers. Accurate transcriptions feed directly into business benefit for our customer companies, as it enables the ASAPP platform to augment agents—providing real-time predictions of what to say and do to address consumers needs, drafting call summary notes, and automating numerous micro-processes. Plus, having insights from full and accurate transcriptions gives these companies a true voice of the customer perspective to inform a range of business decisions.
At ASAPP, innovation is based on our competent research capability that enabled the aforementioned milestones. But our strength is not only in research but also in an agile engineering culture that makes rapid productization of research innovations possible. This is well exemplified by our recent launch of a multistream convolutional neural network (CNN) model to our production ASR systems.
Multistream CNN—where an input audio is processed with different resolutions for better robustness to noisy audio—is one of the main contributing factors to the successful research outcomes from the LibriSpeech race. Its structure consists of multiple streams of convolution layers, each of which is configured with a unique filter resolution for convolutions. The downside to this kind of model is the extra processing time that causes higher latency due to many future speech frames being processed during ASR decoding. Rather than leaving it as a high-performing, but not feasible-in-production research prototype, we invented a multistream CNN model suitable for real-time ASR processing by dynamically assigning compute resources during decoding, while maintaining the same accuracy level as the slower research-lab prototype. Our current production ASR systems take advantage of this optimized model, offering more reliable transcriptions even for noisy audio signals in the agent-customer conversations of contact centers.
As illustrated in Stanley Kubrick’s 1968 movie 2001: Space Odyssey, human aspiration of creating AI able to understand the way we communicate has led to significant technological advancements in many areas. Deep learning has brought recent revolutionary changes to AI research including ASR, which has taken major leaps in the last decade more so than it did in the last 30 years. The radical improvement of ASR accuracy that would make consumers embrace voice recognition products more comfortably than at any time in history are expected to open up a $30 billion market for ASR technology in the next few years.
As we’re entering an era where our own Odyssey to human-level ASR systems might reach the aspired destination soon, ASAPP as a market leader will continue to invest in rapid innovation for AI technology through balancing cutting-edge research and fine-tuned productization to enhance customer experience in meaningful ways.
Our research work in this area was presented at the Ai4 conference.
Kyu Han, PhD is Director of Speech Modeling at ASAPP, leading innovations in deep learning technologies for speech applications. Dr. Han received his PhD from USC in 2009 and held research positions at IBM Watson, Ford Research, Capio (acquired by Twilio) and JD AI Research. He serves as reviewers for IEEE, ISCA and ACL journals/conferences, and is a Speech and Language Technical Committee member in the IEEE SPS since 2019 and part of the Organizing Committee for IEEE SLT 2021. In 2018, he won the ISCA Award for the Best Paper Published in Computer Speech & Language 2013-2017.