SpeechT5 icon indicating copy to clipboard operation
SpeechT5 copied to clipboard

pretrain loss

Open MarsMeng1994 opened this issue 2 years ago • 4 comments

Excuse me, what value does my pre-training loss reach, can I start fintune tts? image i found my finued tts model can generate a mel-spectrom but diffrent to ori mel-spectrom very much。 image Is this due to the bart loss is too high?

MarsMeng1994 avatar Jul 07 '23 01:07 MarsMeng1994

As mentioned in the SpeechT5 paper: "We pre-train the proposed SpeechT5 model on 32 V100 GPUs with a batch size of around 90s samples per GPU for speech and 12k tokens per GPU for text and set the update frequency to 2 for 500k steps." Thus, keeping pre-training. For TTS fine-tuning, the pre-training without $\mathcal{L}{mlm}^s$ is more suitable because as mentioned in the paper "The proposed SpeechT5 trained without $\mathcal{L}{mlm}^s$ is considered because the bidirectional masked pre- diction loss is proposed to help the encoder learn to encode the speech signal, and this variant achieves superior Naturalness, as shown in Table 13 (in Appendix D)."

mechanicalsea avatar Jul 07 '23 04:07 mechanicalsea

thanks for reply does the nums_updates in the log means step? if true, it consume 2 hour for each 100 step in the picture, so it means it will consum10000 hour for pretrain? can i use a english pretrained model to fintune a other language model? can it work?

MarsMeng1994 avatar Jul 07 '23 07:07 MarsMeng1994

10000 hours seems so long. Actually, pre-training on the 32 V100 GPUs cost around one week. So pre-training using multiple gpu is recommended. The fine-tuning on the other languages is available by replace the English vocabulary to the fine-tuned vocabulary, but it causes language mismatch between pre-training and fine-tuning, which may influence the performance of the pre-training method.

mechanicalsea avatar Jul 07 '23 07:07 mechanicalsea

thanks for reply i will try to use more GPU. There is an other question, when pretraining, the num_workers is 0, why don't set it to a higher number such as fintune tts image can i set it to a higher number to accelerate pretraining?

when i set num_workers=1, there is a error like: RuntimeError: unable to mmap 408 bytes from file </torch_2632095_3802486040_258611>: Cannot allocate

MarsMeng1994 avatar Jul 12 '23 07:07 MarsMeng1994