Meta MMS: what it is and how open source templates for text-to-speech and speech-to-text applications work

The company led by Mark Zuckerberg publishes the new open source models trained using up to 1 billion parameters on GitHub. They understand up to 4,000 languages and support audio generation in over 1,000 languages. Great for application development text-to-speech e speech-to-text.

Meta shows that he wants to push the accelerator more and more on generative models innovations and after helping to boost the development of community-driven open source projects with his LLaMa (Large Language Model Meta AI), now Mark Zuckerberg’s company is shaking up the world of artificial intelligence solutions again with a new project.

Is called Massively Multilingual Speech (MMS) and is a model capable of recognizing over 4,000 spoken languages and generating audio thanks to speech synthesis in over 1,100 languages. Like most of Meta’s artificial intelligence offerings, MMS is an open source tool that aims to preserve linguistic diversity and encourage researchers to use them to build innovative applications.

The models of speech recognition and synthesis they assume an intense learning phase on thousands of hours of audio with associated transcript tags. Labels are essential formachine learningallowing algorithms to correctly classify and “understand” the data.

Languages that are not widely used in industrialized nations are at risk of disappearing in the coming decades, and for these there is not enough data that can be used to train the generative model. Meta then took an unconventional approach to the audio data collection drawing on sound recordings of translated religious texts. “We turned to religious texts, such as the Bible, that have been translated into many different languages and whose translations have been studied extensively for text-based language translation research“, the company said. “These translations have publicly available audio recordings of people reading these texts in different languagesBy incorporating unclassified records of the Bible and other similar texts, Meta researchers have increased the languages available and accessible through the model to more than 4,000.

Although the content of the audio recordings is religious, the analyzes carried out by Meta technicians show that this does not affect the model operation to produce all kinds of text. Furthermore, although most of the religious recordings were read by male speakers, this aspect did not introduce any imbalance allowing the synthesis engine automatic adaptation and production of female voices.

On the GitHub page of MMS, Scaling Speech Technology to 1000+ languages, you can find the pretrained models with 300 million and 1 billion parameters, optimized versions of various models and ISO codes of all supported languages.

Comparing MMS with Whisper Of OpenAIthe solution developed by Meta engineers has exceeded the wildest expectations: models trained on MMS data show half the error rate with MMS offering 11 times wider coverage, in terms of supported languages, than the OpenAI proposal.

The new MMS can be used in multiple applications: for the transformation from speech to written text (speech-to-text) and viceversa (text-to-speech) as well as in many other fields.

Of course, there is always a risk that the TTS model might transcribe incorrectly certain words or phrases, but this is a rather common problem for these AI-based systems.

Meta sees a world where the assistive technology, text-to-speech, and even virtual and augmented reality enable everyone to speak and learn in their native language. “We hope for a reality where technology can encourage people to keep their languages alive as they can access information and use every tool using their preferred language“, writes Meta, automatically breaking down all barriers in an increasingly inclusive perspective.

To deepen and collect more details on MMS, you can refer to the post published on the Meta blog.

Meta MMS: what it is and how open source templates for text-to-speech and speech-to-text applications work

Leave a Reply Cancel reply