seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

ValueError: The input waveform must be two dimensional, but has 102528 dimension(s) instead.

Open erkaink opened this issue 5 months ago • 0 comments

Hello, I am getting the error below and I can't find a solution. Does anyone have an idea of ​​what I should do? I asked ChatGPT, I tried making the input sound file Stereo, making it Mono, etc. but it still didn't work. Thanks in advance.

----@---- seamless_communication % m4t_predict input/speech.mp3 --task S2ST --tgt_lang FRA --output_path /Users/username/seamless_communication/output/compl.mp3 2024-09-07 01:34:47,221 INFO -- seamless_communication.cli.m4t.predict.predict: Running inference on device=device(type='cpu') with dtype=torch.float32. Using the cached checkpoint of seamlessM4T_v2_large. Set force to True to download again. Using the cached tokenizer of seamlessM4T_v2_large. Set force to True to download again. Using the cached tokenizer of seamlessM4T_v2_large. Set force to True to download again. Using the cached tokenizer of seamlessM4T_v2_large. Set force to True to download again. Using the cached checkpoint of vocoder_v2. Set force to True to download again. /opt/homebrew/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:134: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. WeightNorm.apply(module, name, dim) 2024-09-07 01:35:09,103 INFO -- seamless_communication.cli.m4t.predict.predict: text_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(1, 200), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0) 2024-09-07 01:35:09,105 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(25, 50), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0) 2024-09-07 01:35:09,105 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_ngram_filtering=False 2024-09-07 01:35:09,141 WARNING -- seamless_communication.inference.translator: Transposing audio tensor from (bsz, seq_len) -> (seq_len, bsz). Traceback (most recent call last): File "/opt/homebrew/bin/m4t_predict", line 8, in sys.exit(main()) ^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/seamless_communication/cli/m4t/predict/predict.py", line 235, in main text_output, speech_output = translator.predict( ^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/seamless_communication/inference/translator.py", line 293, in predict src = self.collate(self.convert_to_fbank(decoded_audio))["fbank"] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: The input waveform must be two dimensional, but has 102528 dimension(s) instead.

erkaink avatar Sep 06 '24 23:09 erkaink