seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

The input sequence length must be less than or equal to the maximum sequence length

Open parzoe opened this issue 1 year ago • 11 comments

I tried a 1 minute audio and it worked just fine but when I tried 7 minutes audio. it throws this error

The input sequence length must be less than or equal to the maximum sequence length (4096), but is 23713 instead.

parzoe avatar Aug 24 '23 14:08 parzoe

Hey @parzoe, I would suggest splitting your audio into smaller chunks since the maximum sequence length our model is designed to handle is 4096. You can force it to handle longer sequences by manually overriding max_seq_len in the model configuration, but that will very likely reduce the quality of output since we haven't trained our model with such long sequences.

cbalioglu avatar Aug 24 '23 15:08 cbalioglu

Thanks @cbalioglu I did split the audio to 7 segments one minute each and the model worked fine but the translation is very poor. every segment of the audio which length 1 minute of talking. the model translated it in 20 second and cut so much from the audio .

parzoe avatar Aug 24 '23 15:08 parzoe

I resampled the input audio to float 32 bit Little Endian, Rate 16000 Hz, Mono and it seems to work a bit better.

florind avatar Aug 24 '23 16:08 florind

you might want to do some voice activity detection (VAD) and split the audio in segments that are self contained instead of maybe splitting in the middle of a sentence.

Mortimerp9 avatar Aug 25 '23 08:08 Mortimerp9

Hey @parzoe, I would suggest splitting your audio into smaller chunks since the maximum sequence length our model is designed to handle is 4096. You can force it to handle longer sequences by manually overriding max_seq_len in the model configuration, but that will very likely reduce the quality of output since we haven't trained our model with such long sequences.

I didn't find the max_seq_len under Translator, would you please provide a demo code?

kk3dmax avatar Aug 25 '23 18:08 kk3dmax

I resampled the input audio to float 32 bit Little Endian, Rate 16000 Hz, Mono and it seems to work a bit better.

Can you please explain how did you do that?

parzoe avatar Aug 26 '23 08:08 parzoe

ffmpeg -i /tmp/hello.wav -ar 16000 -ac 1 -c:a pcm_f32le output_resampled.wav

Btw, the output format I figured after running a T2ST m4t_predict "Hello, world" t2st fra --src_lang eng --output_path /tmp/hello.wav then checked the output format with aplay /tmp/hello.wav

florind avatar Aug 26 '23 08:08 florind

Thanks @florind .

parzoe avatar Aug 26 '23 11:08 parzoe

Hello, I had opened a topic for which I was referred here.

I'm not sure I understand. I tried ffmpeg -i /tmp/hello.wav -ar 16000 -ac 1 -c:a pcm_f32le output_resampled.wav with my file but I have an error like : File ["<ipython-input-26-39495ebabb74>"](https://localhost:8080/#), line 1 ffmpeg -i /content/drive/MyDrive/Audio Space/ICT twitter space - Knowing Your Model Will Deliver [WQO28dHgPAc].wav -ar 16000 -ac 1 -c:a pcm_f32le output_resampled.wav ^ SyntaxError: invalid syntax the arrow points to the _ of (pcm_f32le) Assuming that I'm working on Google Colab, that I have either a video or audio file (I can convert if necessary with ffmpeg. I'd like to work on videos lasting at least 1 hour. Can you confirm that this solution will allow me to work on videos of any length? What are the characteristics and rules to respect in terms of length/time? Can you give me a link, I can't find the answer. But the result will be several output files? is that correct? Do you know if the maximum length will be modified in the future?

BlockSats avatar Sep 01 '23 16:09 BlockSats

Hey @parzoe, I would suggest splitting your audio into smaller chunks since the maximum sequence length our model is designed to handle is 4096. You can force it to handle longer sequences by manually overriding max_seq_len in the model configuration, but that will very likely reduce the quality of output since we haven't trained our model with such long sequences.

Hello @cbalioglu , Do we have any plans to support long audio(>1min) in the future? Thanks.

zhhl9101 avatar Sep 04 '23 06:09 zhhl9101

Simply, I solve it by discarding the excessively units samples in train_manifest.json and validation_manifest.json

for example:

t2u_config=UnitYT2UConfig(model_dim=1024, unit_max_seq_len=2048, target_vocab_info=VocabularyInfo(size=10082, unk_idx=3, bos_idx=0, eos_idx=2, pad_idx=1)

the t2u_config set unit_max_seq_len to 2048, I wrote a python script to drop the samples which units length greater than 2048

iohub avatar Jan 25 '24 01:01 iohub