Continuously improving and expanding the speech to text offering
In addition to updating its offering of speech to text models to improve accuracy, Zoom Media is continuously expanding its offering of speech to text languages overall as well. In April we added Filipino to the mix, which we developed in collaboration with our partner Microsoft. Now we are proud to announce to have deployed Italian and Modern Standard Arabic (MSA). Last but not least, as the first of our Nordic language models, we updated our Norwegian speech to text model which now reaches a much higher accuracy.
Arabic, a tough nut to crack
Now Arabic especially proves to be a tough nut to crack in the field of speech recognition research since the number of varieties and dialects of the language is simply so vast. Which is why, as a start, Zoom Media decided to focus on the Modern Standard variant which is spoken mostly on television throughout the Arabic speaking nations. Since Zoom Media focuses strongly on the Media & Entertainment industry, the choice for Modern Standard Arabic was a logical one. Thanks to our unique machine learning approach we managed to achieve a Word Error Rate (WER) of 15.2 on average for Modern Standard Arabic, meaning clear speech audio reaches a 90%+ accuracy on the current model we support. The Zoom Media model for Modern Standard Arabic also means a nice addition to the offering of our partner Microsoft who now support Egyptian Arabic, meaning together Zoom Media and Microsoft can offer more value to the Arabic Media & Entertainment market.
Now Italian might not be the most exotic language when it comes to speech to text development, of course, there are numerous companies out there offering the language, but it never hurts to give consumers the chance to try out as many options as possible. Plus, we had a good number of training data available making it not much of a stretch to develop an Italian speech to text model that can reach a good accuracy for broadcast data. Currently, the model reaches an average WER of 16.1, meaning when it comes to clear speech the model can also reach a 90%+ accuracy.
Now it’s not all about deploying as much new language models as possible, in our opinion, it’s better to make sure the ones you have are as good as humanly (machinely?) possible. Which is why on a continuous basis we run iterations on all our models to make sure the accuracy level increases and remains at a high level. We were struggling a bit with Norwegian due to a lack of training data, but now we are glad to say the model has improved from an average WER of 21.2 to 17.2, which makes it interesting for broadcasters in Norway. A challenge yet lies in the difference of the major dialects the country has, but with sufficient training and feedback from our clients, we are confident we can cater to everyone’s needs real soon!