Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

About TTS resume

Open arieszhang1994 opened this issue 1 year ago • 3 comments

HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302

however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145

So when resume_type=="resume", self. _check_resume function seems not to work.

Is there something which I missed?

arieszhang1994 avatar Jan 08 '24 07:01 arieszhang1994

For another issue, I am confused with the phon_id_collator.get_phone_id_sequence when i run the infer process with sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \ --config ckpts/tts/vits_ljspeech/args.json \ --infer_expt_dir ckpts/tts/vits_ljspeech/ \ --infer_output_dir ckpts/tts/vits_ljspeech/result \ --infer_mode "single" \ --infer_text "This is a clip of generated speech with the given text from a TTS model."

in https://github.com/open-mmlab/Amphion/blob/main/models/tts/vits/vits_inference.py#L116

the text is 'This is a clip of generated speech with the given text from a TTS model.' the phone_seq is '['DH', 'IH0', 'S', 'IH0', 'Z', 'AH0', 'K', 'L', 'IH1', 'P', 'AH0', 'V', 'JH', 'EH1', 'N', 'ER0', 'EY2', 'T', 'AH0', 'D', 'S', 'P', 'IY1', 'CH', 'W', 'IH0', 'DH', 'DH', 'AH0', 'G', 'IH1', 'V', 'AH0', 'N', 'T', 'EH1', 'K', 'S', 'T', 'F', 'ER0', 'M', 'AH0', 'T', 'IY1', 'EH1', 'N', 'IY1', 'S', 'M', 'AA1', 'D', 'AH0', 'L']' however, the phone_id_seq is [41, 45, 11, 46, 45, 63, 42, 55, 52, 11, 56, 11, 46, 45, 63, 42, 55, 52, 11, 63, 11, 38, 45, 63, 42, 55, 52, 11, 48, 11, 49, 11, 46, 45, 52, 51, 42, 11, 53, 11, 38, 45, 63, 42, 55, 52, 11, 59, 11, 47, 45, 11, 42, 45, 52, 51, 42, 11, 51, 11, 42, 55, 63, 42, 55, 52, 11, 42, 62, 57, 60, 52, 11, 57, 11, 38, 45, 63, 42, 55, 52, 11, 41, 11, 56, 11, 53, 11, 46, 62, 52, 51, 42, 11, 40, 45, 11, 60, 11, 46, 45, 63, 42, 55, 52, 11, 41, 45, 11, 41, 45, 11, 38, 45, 63, 42, 55, 52, 11, 44, 11, 46, 45, 52, 51, 42, 11, 59, 11, 38, 45, 63, 42, 55, 52, 11, 51, 11, 57, 11, 42, 45, 52, 51, 42, 11, 48, 11, 56, 11, 57, 11, 43, 11, 42, 55, 63, 42, 55, 52, 11, 50, 11, 38, 45, 63, 42, 55, 52, 11, 57, 11, 46, 62, 52, 51, 42, 11, 42, 45, 52, 51, 42, 11, 51, 11, 46, 62, 52, 51, 42, 11, 56, 11, 50, 11, 38, 38, 52, 51, 42, 11, 41, 11, 38, 45, 63, 42, 55, 52, 11, 49]

when I run text.sequence_to_text(phone_id_seq) the result is dh ihzero s ihzero z ahzero k l ihone p ahzero v jh ehone n erzero eytwo t ahzero d s p iyone ch w ihzero dh dh ahzero g ihone v ahzero n t ehone k s t f erzero m ahzero t iyone ehone n iyone s m aaone d ahzero l

does amphion do this on purpose?

arieszhang1994 avatar Jan 12 '24 14:01 arieszhang1994

HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302

however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145

So when resume_type=="resume", self. _check_resume function seems not to work.

Is there something which I missed?

Thanks for your feedback. Please check this PR #108 .

lmxue avatar Jan 16 '24 11:01 lmxue

HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302 however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145 So when resume_type=="resume", self. _check_resume function seems not to work. Is there something which I missed?

Thanks for your feedback. Please check this PR #108 .

Thank you! Besides, can you check the second issue I mentioned? I tried to add phones =" ".join(phone_seq) phones = "{"+phones"}" phone_seq=phones.split(" ")
after this line: https://github.com/open-mmlab/Amphion/blob/a840088a9cc1d5c3afa3ed2f6c39db35c32d1f65/models/tts/vits/vits_dataset.py#L80 and retrain a new vits model.

also I change the same code of inference. However, the retrained model fails to synthesize human-understandable English. Although the loss seems normal (dropped to 37). The generated demo sounds like the phone embedding haven't be trained. It's so weird that I have debugged for several days and stil can't find out the reason now.

arieszhang1994 avatar Jan 16 '24 11:01 arieszhang1994

For another issue, I am confused with the phon_id_collator.get_phone_id_sequence when i run the infer process with sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \ --config ckpts/tts/vits_ljspeech/args.json \ --infer_expt_dir ckpts/tts/vits_ljspeech/ \ --infer_output_dir ckpts/tts/vits_ljspeech/result \ --infer_mode "single" \ --infer_text "This is a clip of generated speech with the given text from a TTS model."

in https://github.com/open-mmlab/Amphion/blob/main/models/tts/vits/vits_inference.py#L116

the text is 'This is a clip of generated speech with the given text from a TTS model.' the phone_seq is '['DH', 'IH0', 'S', 'IH0', 'Z', 'AH0', 'K', 'L', 'IH1', 'P', 'AH0', 'V', 'JH', 'EH1', 'N', 'ER0', 'EY2', 'T', 'AH0', 'D', 'S', 'P', 'IY1', 'CH', 'W', 'IH0', 'DH', 'DH', 'AH0', 'G', 'IH1', 'V', 'AH0', 'N', 'T', 'EH1', 'K', 'S', 'T', 'F', 'ER0', 'M', 'AH0', 'T', 'IY1', 'EH1', 'N', 'IY1', 'S', 'M', 'AA1', 'D', 'AH0', 'L']' however, the phone_id_seq is [41, 45, 11, 46, 45, 63, 42, 55, 52, 11, 56, 11, 46, 45, 63, 42, 55, 52, 11, 63, 11, 38, 45, 63, 42, 55, 52, 11, 48, 11, 49, 11, 46, 45, 52, 51, 42, 11, 53, 11, 38, 45, 63, 42, 55, 52, 11, 59, 11, 47, 45, 11, 42, 45, 52, 51, 42, 11, 51, 11, 42, 55, 63, 42, 55, 52, 11, 42, 62, 57, 60, 52, 11, 57, 11, 38, 45, 63, 42, 55, 52, 11, 41, 11, 56, 11, 53, 11, 46, 62, 52, 51, 42, 11, 40, 45, 11, 60, 11, 46, 45, 63, 42, 55, 52, 11, 41, 45, 11, 41, 45, 11, 38, 45, 63, 42, 55, 52, 11, 44, 11, 46, 45, 52, 51, 42, 11, 59, 11, 38, 45, 63, 42, 55, 52, 11, 51, 11, 57, 11, 42, 45, 52, 51, 42, 11, 48, 11, 56, 11, 57, 11, 43, 11, 42, 55, 63, 42, 55, 52, 11, 50, 11, 38, 45, 63, 42, 55, 52, 11, 57, 11, 46, 62, 52, 51, 42, 11, 42, 45, 52, 51, 42, 11, 51, 11, 46, 62, 52, 51, 42, 11, 56, 11, 50, 11, 38, 38, 52, 51, 42, 11, 41, 11, 38, 45, 63, 42, 55, 52, 11, 49]

when I run text.sequence_to_text(phone_id_seq) the result is dh ihzero s ihzero z ahzero k l ihone p ahzero v jh ehone n erzero eytwo t ahzero d s p iyone ch w ihzero dh dh ahzero g ihone v ahzero n t ehone k s t f erzero m ahzero t iyone ehone n iyone s m aaone d ahzero l

does amphion do this on purpose?

When cfg.preprocess.phone_extractor == "lexicon", we convert text to phone sequence based on the dictionary defined in https://raw.githubusercontent.com/open-mmlab/Amphion/main/text/lexicon/librispeech-lexicon.txt. For the conversion from phone sequence to phone ID sequence, we currently uses the phoneme set from the https://github.com/HarryHe11/vc-dev/blob/main/text/symbols.py. However, it should use the phoneme set from the librispeech-lexicon.txt. I'll refactor this part. Thanks for your feedback.

lmxue avatar Feb 15 '24 12:02 lmxue

Hi @arieszhang1994 , If you have any further questions about the TTS resume, feel free to re-open this issue. We are glad to follow up!

HarryHe11 avatar Feb 16 '24 09:02 HarryHe11