Automatic Speech Recognition (ASR) software is not new. It’s been around since the fifties, but the last couple of years several innovations have made it possible to develop high-quality speech to text engines in a (relatively) short amount of time.
People need accurate software that can recognise speakers, can handle complex vocabulary and even customised terminology because of the exponential growth of streams of content we need to process, store and re-use.
Data is important for the success of any machine learning project and A.I.is no different. By developing and implementing the right filters and best practices, we are able to train our software to a high level of accuracy (at least 90 percent) in less than 3 months.
We like to share our methods with you because we believe that sharing this knowledge with our (future) clients will help both parties to understand the potential of this technology. Generally speaking, you need good data to create a good model. The quality of the data is a key factor for success. But what defines the quality of data? Well, it depends on the goal, the client and the final use of the speech to text service. However, we’ve put together a list of requirements for the datasets we use.
1 Quantity: We need at least one hour of audio with different people speaking in various settings, along with a handwritten 100 percent correct transcript of the audio file. This way we can start training variations right away.
2 Quality: Clean and clear audio with correct transcripts are the best start of the developing process. But most of the time it’s hard to find a dataset like this. In this case, we will use automated ‘data cleaning’ to filter the dataset. We avoid manual labour for simple tasks and just add nuances and corrections that machines will miss. We’ve developed these best practices by continuously experimenting with what task should be done manually and what tasks we can automate.
3 Domain-specific instructions: We provide a custom-made speech to text-service and that’s why we need detailed information about what kind of language, names and terms are important to you. Also, it depends on what kind of audio you will describe since for example meetings, video’s and conference calls all have other (technical) settings. The more information we have, the better we can filter and correct the data set in order to create well-matching training data to deliver the best custom made speech to text engine possible.
Over the last six months, we’ve noticed an increasing number of people understanding the possibilities and value of this technology and suggesting various use-cases to apply speech to text. From broadcaster to the government, from simple audio files to meetings with complex use of language such as lawyers and doctors. The possibilities are endless, but to really make a difference and create a high-quality speech to text engine, customization remains key.