seamless_communication
seamless_communication copied to clipboard
The input sequence length must be less than or equal to the maximum sequence length
I tried a 1 minute audio and it worked just fine but when I tried 7 minutes audio. it throws this error
The input sequence length must be less than or equal to the maximum sequence length (4096), but is 23713 instead.
Hey @parzoe, I would suggest splitting your audio into smaller chunks since the maximum sequence length our model is designed to handle is 4096. You can force it to handle longer sequences by manually overriding max_seq_len
in the model configuration, but that will very likely reduce the quality of output since we haven't trained our model with such long sequences.
Thanks @cbalioglu I did split the audio to 7 segments one minute each and the model worked fine but the translation is very poor. every segment of the audio which length 1 minute of talking. the model translated it in 20 second and cut so much from the audio .
I resampled the input audio to float 32 bit Little Endian, Rate 16000 Hz, Mono and it seems to work a bit better.
you might want to do some voice activity detection (VAD) and split the audio in segments that are self contained instead of maybe splitting in the middle of a sentence.
Hey @parzoe, I would suggest splitting your audio into smaller chunks since the maximum sequence length our model is designed to handle is 4096. You can force it to handle longer sequences by manually overriding
max_seq_len
in the model configuration, but that will very likely reduce the quality of output since we haven't trained our model with such long sequences.
I didn't find the max_seq_len
under Translator, would you please provide a demo code?
I resampled the input audio to float 32 bit Little Endian, Rate 16000 Hz, Mono and it seems to work a bit better.
Can you please explain how did you do that?
ffmpeg -i /tmp/hello.wav -ar 16000 -ac 1 -c:a pcm_f32le output_resampled.wav
Btw, the output format I figured after running a T2ST m4t_predict "Hello, world" t2st fra --src_lang eng --output_path /tmp/hello.wav
then checked the output format with aplay /tmp/hello.wav
Thanks @florind .
Hello, I had opened a topic for which I was referred here.
I'm not sure I understand.
I tried ffmpeg -i /tmp/hello.wav -ar 16000 -ac 1 -c:a pcm_f32le output_resampled.wav
with my file
but I have an error like :
File ["<ipython-input-26-39495ebabb74>"](https://localhost:8080/#), line 1 ffmpeg -i /content/drive/MyDrive/Audio Space/ICT twitter space - Knowing Your Model Will Deliver [WQO28dHgPAc].wav -ar 16000 -ac 1 -c:a pcm_f32le output_resampled.wav ^ SyntaxError: invalid syntax
the arrow points to the _ of (pcm_f32le)
Assuming that I'm working on Google Colab, that I have either a video or audio file (I can convert if necessary with ffmpeg.
I'd like to work on videos lasting at least 1 hour.
Can you confirm that this solution will allow me to work on videos of any length?
What are the characteristics and rules to respect in terms of length/time?
Can you give me a link, I can't find the answer.
But the result will be several output files? is that correct?
Do you know if the maximum length will be modified in the future?
Hey @parzoe, I would suggest splitting your audio into smaller chunks since the maximum sequence length our model is designed to handle is 4096. You can force it to handle longer sequences by manually overriding
max_seq_len
in the model configuration, but that will very likely reduce the quality of output since we haven't trained our model with such long sequences.
Hello @cbalioglu , Do we have any plans to support long audio(>1min) in the future? Thanks.
Simply, I solve it by discarding the excessively units samples in train_manifest.json
and validation_manifest.json
for example:
t2u_config
=UnitYT2UConfig(model_dim=1024, unit_max_seq_len=2048
, target_vocab_info=VocabularyInfo(size=10082, unk_idx=3, bos_idx=0, eos_idx=2, pad_idx=1)
the t2u_config set unit_max_seq_len to 2048, I wrote a python script to drop the samples which units length greater than 2048