Automatic Speech Recognition
Speech recognition is an interdisciplinary branch of computer science and computational linguistics that develops approaches and technologies that allow computers to recognize and translate spoken language into text for searchability. It is also referred to as speech to text (STT), computer speech recognition, or automatic speech recognition (ASR). It draws on expertise and research from the domains of computer science, linguistics, and computer engineering. Speech synthesis is the opposite process.
A typical text-to-speech model is made up of two parts: one to convert text into an intermediate speech representation (mel-spectrograms), and another to convert the generated mel-spectrograms into audio. Over the past several years, there have been remarkable progress towards generating more natural and authentic speech using deep neural networks. Our goal is to not only produce highly realistic speech, but to also take the next step by conducting research in the fields including but not limited to style transfer, voice conversion, and multilingual TTS.