seamless_communication Results of ASR are incomplete

Results of ASR are incomplete

Open ysapolovych opened this issue 1 year ago • 3 comments

My issue seems very similar to https://github.com/facebookresearch/seamless_communication/issues/83 , but I am using Translator Python API + ASR task. My input is 30 seconds long, and I get about half of it transcribed:

from seamless_communication.models.inference import Translator
import torch

device = torch.device('cuda:0')

translator = Translator('seamlessM4T_medium',
                        vocoder_name_or_card='vocoder_36langs',
                        device=device,
                        dtype=torch.float32)
                        
full_text, wav, out_sr = translator.predict(input='1694165663513.wav',
                                 task_str='ASR',
                                 tgt_lang='eng',
                                 src_lang='eng',
                                 sample_rate=16000,
                                 ngram_filtering=True)

I wonder if params text_max_len_a, text_max_len_b, unit_max_len_a, and unit_max_len_b of predict method somehow contribute to that (alas, they are undocumented). Playing with them, however, did nothing.

Sep 12 '23 10:09 ysapolovych

Yes. If a try 60 sec Audio, get 20 sec transcribing, If I send 20 seconds , get 10 sec. Transcribied. If send 10 seconds audio, get 5 seconds transcribed ?

Sep 12 '23 11:09 casic

Anything on this?

Thank you

Sep 14 '23 11:09 BakingBrains

meet the same question, someone konw this?

Sep 15 '23 06:09 lixikun

seamless_communication seamless_communication copied to clipboard

Results of ASR are incomplete

seamless_communication
seamless_communication copied to clipboard