seamless_communication
seamless_communication copied to clipboard
S2TT wrong reults after SPEECH_TO_TEXT finetuning
Hi everyone,
We attempted to fine-tune the SPEECH_TO_TEXT component using our domain-specific data. However, when we used the fine-tuned English ASR model for English-to-Arabic translation, it consistently generated English text instead of Arabic. Our training steps were as follows:
-
Generating our domain-specific data. Here is an example:
{ "source":{ "id":"004NpHoq0sw_23863680-23893280", "lang":"eng", "text":"joe and pronto couldn't compete", "audio_local_path":"/path/004NpHoq0sw_23863680-23893280.wav", "waveform":null, "sampling_rate":16000, "units":null }, "target":{ "id":"004NpHoq0sw_23863680-23893280", "lang":"eng", "text":"joe and pronto couldn't compete", "audio_local_path":"/path/004NpHoq0sw_23863680-23893280.wav", "waveform":null, "sampling_rate":16000, "units":null } }
-
Fine-tuning the model. The prepared data in the 1st step was stored in train2.json torchrun
--rdzv-backend=c10d
--rdzv-endpoint=localhost:0
--nnodes=1
--nproc-per-node=8
--no-python
m4t_finetune
--mode SPEECH_TO_TEXT
--train_dataset train_data/train2.json
--eval_dataset train_data/test.json
--learning_rate 1e-6
--warmup_steps 1000
--eval_steps 5000
--max_epochs 10
--patience 100
--batch_size 1
--model_name seamlessM4T_medium
--save_model_to train_data/checkpoint.pt -
S2TT Decoding. python predict.py test.wav s2tt arb --src_lang eng --output_path output/ --model_name seamlessM4T_medium
The code has been revized to load the finetuned model:
translator = Translator(args.model_name, args.vocoder_name, device, dtype) translator.model.load_state_dict(torch.load("train_data/checkpoint.pt"))
We expected the decoding process to produce Arabic text, but it consistently generated English, which is quite puzzling. Could you please advise whether this issue is related to data preparation or the model training itself?
@woqiang0515 Is it intentional to finetune the model on English ASR dataset and evaluate it for English to Arabic translation? Would suggest finetuning on English to Arabic S2T dataset if the goal is to improve its performance. In this specific case I am suspecting that the model is forgetting how to translate as its being finetuned only on ASR task. I would suggest verifying on the ASR task itself to ensure that the finetuning itself is working as expected.
Thank you very much for your response. We thought the module was independent, but it seems that fine-tuning the English ASR did not significantly improve the translation performance from English to Arabic. We will re-finetune it using translation data specifically for English to Arabic. Thanks again for your reply.
@woqiang0515 Is it intentional to finetune the model on English ASR dataset and evaluate it for English to Arabic translation? Would suggest finetuning on English to Arabic S2T dataset if the goal is to improve its performance. In this specific case I am suspecting that the model is forgetting how to translate as its being finetuned only on ASR task. I would suggest verifying on the ASR task itself to ensure that the finetuning itself is working as expected.
I have another question: if ASR and S2TT tasks are not independent, how can we simultaneously fine-tune ASR data within the domain and domain-specific translation data? Do we need to fine-tune them jointly or do it step by step?
Hi everyone,
We attempted to fine-tune the SPEECH_TO_TEXT component using our domain-specific data. However, when we used the fine-tuned English ASR model for English-to-Arabic translation, it consistently generated English text instead of Arabic. Our training steps were as follows:
- Generating our domain-specific data. Here is an example:
{ "source":{ "id":"004NpHoq0sw_23863680-23893280", "lang":"eng", "text":"joe and pronto couldn't compete", "audio_local_path":"/path/004NpHoq0sw_23863680-23893280.wav", "waveform":null, "sampling_rate":16000, "units":null }, "target":{ "id":"004NpHoq0sw_23863680-23893280", "lang":"eng", "text":"joe and pronto couldn't compete", "audio_local_path":"/path/004NpHoq0sw_23863680-23893280.wav", "waveform":null, "sampling_rate":16000, "units":null } }
- Fine-tuning the model. The prepared data in the 1st step was stored in train2.json torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc-per-node=8
--no-python m4t_finetune --mode SPEECH_TO_TEXT --train_dataset train_data/train2.json
--eval_dataset train_data/test.json --learning_rate 1e-6 --warmup_steps 1000 --eval_steps 5000 --max_epochs 10 --patience 100 --batch_size 1 --model_name seamlessM4T_medium --save_model_to train_data/checkpoint.pt- S2TT Decoding. python predict.py test.wav s2tt arb --src_lang eng --output_path output/ --model_name seamlessM4T_medium
The code has been revized to load the finetuned model:
translator = Translator(args.model_name, args.vocoder_name, device, dtype) translator.model.load_state_dict(torch.load("train_data/checkpoint.pt"))
We expected the decoding process to produce Arabic text, but it consistently generated English, which is quite puzzling. Could you please advise whether this issue is related to data preparation or the model training itself?
Hi, I'm facing issue when trying to fine tune using seamless with this error:
FileNotFoundError: [Errno 2] No such file or directory: 'm4t_finetune'
can you explain me what should I do with 'm4t_finetune'? I'm using the same script that u used. Thanks in advance