seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

Is it possible to generate subtitles like whisper ai?

Open barinov274 opened this issue 2 years ago • 6 comments
trafficstars

seamlessm4t can generate transcription so I thought that it storing timestamps somewhere. Is it true?

barinov274 avatar Aug 23 '23 02:08 barinov274

I would also like to know this as it would be incredibly beneficial to have this as an optional feature!

dillfrescott avatar Aug 23 '23 03:08 dillfrescott

Hey @barinov274 - It's not trivial to get ASR timestamps for our model unfortunately. Since it shares with translation tasks, decoding process is not "monotonic" like other ASR approaches (e.g. CTC) Technically i-th generated token t_i could attend to x-th source token s_x and t_j to s_y, with i < j but x > y.

cndn avatar Aug 23 '23 17:08 cndn

Hi @barinov274! Unlike Whisper and although we both use an encoder-decoder architecture, we didn't train for ASR with timestamp tokens. Our focus is translation and ASR is treated as S2TT in the same source language. As @cndn mentioned we can technically attend to the source audio in a non-monotonic fashion. That said, we can potentially leverage the encoder-decoder attention matrices to infer some monotonic alignment between the source audio and target text and use that to output timestamps. I'll try this option and see if it's accurate then share updates here

elbayadm avatar Aug 24 '23 16:08 elbayadm

I'll try something using this approach. For vod content is ok because we don't have problem in wait to result and make a srt/cry as result. I'm try to understand at all because I'm looking for some models and approach's to make this happen with live streaming content.

Thinking in resource necessary to do this what is your recommendation ? Thinking in production environment?

Leeaandrob avatar Dec 15 '23 10:12 Leeaandrob

Any progress on this? I am very far removed from the Machine Learning world so I am unable to contribute, but I'm keen on using either this model or Meta MMS to generate subtitles for a low-resource language (whisper is utterly incapable of getting good results for the respective language).

jtlonsako avatar Apr 19 '24 00:04 jtlonsako