CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Support for Facebook's new SeamlessM4T (Multilingual + Multimodal)

Open Infinitay opened this issue 2 years ago • 4 comments

Facebook just released a new multimodal model for multiple languages. I would assume it's the successor to NLLB. One model to rule them all. It would be amazing to have CT2 support for this to further reduce the size of the large model. If I remember correctly, when I used Whisper large and NLLB-200 medium, I was using about 9-10 GB of VRAM with what should be under 3B parameters. Switching to CT2s whisper large-v2 and NLLB-200 medium (both float16) took me to 5-6 GB of VRAM. I'm hoping that with CT2 support for SeamlessM4T we can see similar improvements with negligible loss of accuracy all while maintaining solid multimodal metrics. That being said, in the future if there is support for SM4T, would you be as so kind as to include metrics of vanilla SM4T and CT2's SM4T for as many tasks (e.g. S2TT, T2TT, etc.) possible? If not, maybe a script so we can analyze it ourselves?

Thanks, hopefully it's not much of an ask to add support for in the future and that other people can take advantage of this.

image

Website: https://ai.meta.com/resources/models-and-libraries/seamless-communication/ Code: https://github.com/facebookresearch/seamless_communication Paper: https://ai.meta.com/research/publications/seamless-m4t/ Blog Post: https://ai.meta.com/blog/seamless-m4t/

Some Metrics

image

image

image

Infinitay avatar Aug 23 '23 19:08 Infinitay

The speech-to-speech translation of this model is pretty good, there's an online demo here: https://seamless.metademolab.com/

hobodrifterdavid avatar Aug 30 '23 04:08 hobodrifterdavid

The demo looks nice overall but doing this first simple test here:

image

ASR is good 100% Translation is wrong when DeepL and Google Translate are 100% accurate. TTS is good but with wrong translation.

The issue with Meta's model (it was already the case with NLLB) is that the research goal is really usefull but when it is not SOTA and gives glitches like this, in the end you are reluctant to use these. If they were open sourcing 100% then community could contribute to improve their work.

don't get me wrong the work remains impressive.

vince62s avatar Sep 01 '23 13:09 vince62s

In the paper they compare it to a cascaded approach (ASR, then translation, then TTS), I didn't look in detail, the nice thing here is it's all one model, easy to deploy, for 35 to 35 languages.

For TTS, outside of a handful of languages, it's hard to find decent sounding models (comparable to say, Microsoft apis).. seamless seems to do pretty well in terms of 'technical' quality, but the specific voices it's fine tuned on could have been better. It sounds like they used ljspeech for English, they could have used our Jenny dataset rather (https://youtu.be/JZWeYbtCisk?si=xfP-Km3ZFGRI7ZTZ&t=239). :D

hobodrifterdavid avatar Sep 02 '23 16:09 hobodrifterdavid