Why would you develop models in house if you can just use the ones Google, IBM and others offer at a lower price? Fair question, and one which I get asked quite a lot since we’ve been developing language models ourselves at Zoom Media for quite some time now. The answer is simple: quality control.
One size doesn’t always fit all
Although it’s true that there are several companies offering various language models that can reach a high accuracy, at Zoom Media we believe customization of these models is paramount when it comes to offering our end users the best possible solution. Speech to text simply isn’t as generic as say face or logo detection. Languages, in general, have so many variations on a regional level, let alone across borders, that one can not deny the fact that a tailor-made solution is what will offer the most value.
Take the Dutch language for example. The Netherlands is made out of 12 provinces, but there are approximately 24 dialects. Not a small number for a country with a population of around 17 million people. It’s safe to say that a generic Dutch model does not offer the same results for all those dialects. After all, a language model is made out of vocabulary and an acoustic model to recognise specific words and vocal sounds, if these don’t match certain words and vocal sounds characteristic for a specific dialect then the results won’t make much sense.
Customization to the rescue
That’s why at Zoom Media we always check with a potential customer which specific use case he or she has. In some cases, our generic language model will do the trick, but in other cases, we need to work together to customize and train a model that fits the specific needs of end-users better. Now the big tech guys out there also offer some form of customization, but the major difference is that at Zoom Media we train a whole new model altogether on data from the client. When it comes to customization that other vendors offer, additional client data is added to the generic model. Now, this approach can help somewhat in some cases, but it also increases the chance of a model, making more mistakes. After all, the more words and vocal sounds that are added to vocabulary and acoustic model, the bigger the chance the model makes the wrong decision – or probability calculation because that’s basically what these models do.
Recently we organized a get together with our partner Microsoft and a couple of Dutch regional broadcasters. The first steps were taken to look into developing customized language models for the different regions, and we’re really excited to share these results when the time comes. Stay tuned!
Want to know more about customization and tailor-made speech to text models Zoom Media has to offer? Get in touch, and send me an email at firstname.lastname@example.org. Talk to you soon!