13.04.21
Alexa, Siri and the rest have made speaking to computers a natural thing to do. The big tech firms have spent lots of money creating automatic speech recognition (ASR) models that convert speech to text.
Generally this has been a hard problem needing specialised domain knowledge and large amounts of labelled data. However, the latest models have achieved state of the art performance on much smaller volumes of labelled data by doing unsupervised pre-training on large unlabelled datasets.
Now the marvellous people at Hugging Face have made the best performing of these algorithms, Wave2Vec, available as part of its open source Transformers library. Hopefully this will make it easier for practitioners to make progress on building audio models using this as a starting point.
Why this matters: speech is the big new way of interacting with computers. This makes development open to all.