Amphion
Amphion copied to clipboard
About TTS resume
HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302
however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145
So when resume_type=="resume", self. _check_resume function seems not to work.
Is there something which I missed?
For another issue, I am confused with the phon_id_collator.get_phone_id_sequence
when i run the infer process with
sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \ --config ckpts/tts/vits_ljspeech/args.json \ --infer_expt_dir ckpts/tts/vits_ljspeech/ \ --infer_output_dir ckpts/tts/vits_ljspeech/result \ --infer_mode "single" \ --infer_text "This is a clip of generated speech with the given text from a TTS model."
in https://github.com/open-mmlab/Amphion/blob/main/models/tts/vits/vits_inference.py#L116
the text
is 'This is a clip of generated speech with the given text from a TTS model.'
the phone_seq
is '['DH', 'IH0', 'S', 'IH0', 'Z', 'AH0', 'K', 'L', 'IH1', 'P', 'AH0', 'V', 'JH', 'EH1', 'N', 'ER0', 'EY2', 'T', 'AH0', 'D', 'S', 'P', 'IY1', 'CH', 'W', 'IH0', 'DH', 'DH', 'AH0', 'G', 'IH1', 'V', 'AH0', 'N', 'T', 'EH1', 'K', 'S', 'T', 'F', 'ER0', 'M', 'AH0', 'T', 'IY1', 'EH1', 'N', 'IY1', 'S', 'M', 'AA1', 'D', 'AH0', 'L']'
however, the phone_id_seq
is [41, 45, 11, 46, 45, 63, 42, 55, 52, 11, 56, 11, 46, 45, 63, 42, 55, 52, 11, 63, 11, 38, 45, 63, 42, 55, 52, 11, 48, 11, 49, 11, 46, 45, 52, 51, 42, 11, 53, 11, 38, 45, 63, 42, 55, 52, 11, 59, 11, 47, 45, 11, 42, 45, 52, 51, 42, 11, 51, 11, 42, 55, 63, 42, 55, 52, 11, 42, 62, 57, 60, 52, 11, 57, 11, 38, 45, 63, 42, 55, 52, 11, 41, 11, 56, 11, 53, 11, 46, 62, 52, 51, 42, 11, 40, 45, 11, 60, 11, 46, 45, 63, 42, 55, 52, 11, 41, 45, 11, 41, 45, 11, 38, 45, 63, 42, 55, 52, 11, 44, 11, 46, 45, 52, 51, 42, 11, 59, 11, 38, 45, 63, 42, 55, 52, 11, 51, 11, 57, 11, 42, 45, 52, 51, 42, 11, 48, 11, 56, 11, 57, 11, 43, 11, 42, 55, 63, 42, 55, 52, 11, 50, 11, 38, 45, 63, 42, 55, 52, 11, 57, 11, 46, 62, 52, 51, 42, 11, 42, 45, 52, 51, 42, 11, 51, 11, 46, 62, 52, 51, 42, 11, 56, 11, 50, 11, 38, 38, 52, 51, 42, 11, 41, 11, 38, 45, 63, 42, 55, 52, 11, 49]
when I run
text.sequence_to_text(phone_id_seq)
the result is
dh ihzero s ihzero z ahzero k l ihone p ahzero v jh ehone n erzero eytwo t ahzero d s p iyone ch w ihzero dh dh ahzero g ihone v ahzero n t ehone k s t f erzero m ahzero t iyone ehone n iyone s m aaone d ahzero l
does amphion do this on purpose?
HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302
however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145
So when resume_type=="resume", self. _check_resume function seems not to work.
Is there something which I missed?
Thanks for your feedback. Please check this PR #108 .
HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302 however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145 So when resume_type=="resume", self. _check_resume function seems not to work. Is there something which I missed?
Thanks for your feedback. Please check this PR #108 .
Thank you!
Besides, can you check the second issue I mentioned? I tried to add
phones =" ".join(phone_seq)
phones = "{"+phones"}"
phone_seq=phones.split(" ")
after this line:
https://github.com/open-mmlab/Amphion/blob/a840088a9cc1d5c3afa3ed2f6c39db35c32d1f65/models/tts/vits/vits_dataset.py#L80
and retrain a new vits model.
also I change the same code of inference. However, the retrained model fails to synthesize human-understandable English. Although the loss seems normal (dropped to 37). The generated demo sounds like the phone embedding haven't be trained. It's so weird that I have debugged for several days and stil can't find out the reason now.
For another issue, I am confused with the
phon_id_collator.get_phone_id_sequence
when i run the infer process withsh egs/tts/VITS/run.sh --stage 3 --gpu "0" \ --config ckpts/tts/vits_ljspeech/args.json \ --infer_expt_dir ckpts/tts/vits_ljspeech/ \ --infer_output_dir ckpts/tts/vits_ljspeech/result \ --infer_mode "single" \ --infer_text "This is a clip of generated speech with the given text from a TTS model."
in https://github.com/open-mmlab/Amphion/blob/main/models/tts/vits/vits_inference.py#L116
the
text
is'This is a clip of generated speech with the given text from a TTS model.'
thephone_seq
is'['DH', 'IH0', 'S', 'IH0', 'Z', 'AH0', 'K', 'L', 'IH1', 'P', 'AH0', 'V', 'JH', 'EH1', 'N', 'ER0', 'EY2', 'T', 'AH0', 'D', 'S', 'P', 'IY1', 'CH', 'W', 'IH0', 'DH', 'DH', 'AH0', 'G', 'IH1', 'V', 'AH0', 'N', 'T', 'EH1', 'K', 'S', 'T', 'F', 'ER0', 'M', 'AH0', 'T', 'IY1', 'EH1', 'N', 'IY1', 'S', 'M', 'AA1', 'D', 'AH0', 'L']'
however, thephone_id_seq
is[41, 45, 11, 46, 45, 63, 42, 55, 52, 11, 56, 11, 46, 45, 63, 42, 55, 52, 11, 63, 11, 38, 45, 63, 42, 55, 52, 11, 48, 11, 49, 11, 46, 45, 52, 51, 42, 11, 53, 11, 38, 45, 63, 42, 55, 52, 11, 59, 11, 47, 45, 11, 42, 45, 52, 51, 42, 11, 51, 11, 42, 55, 63, 42, 55, 52, 11, 42, 62, 57, 60, 52, 11, 57, 11, 38, 45, 63, 42, 55, 52, 11, 41, 11, 56, 11, 53, 11, 46, 62, 52, 51, 42, 11, 40, 45, 11, 60, 11, 46, 45, 63, 42, 55, 52, 11, 41, 45, 11, 41, 45, 11, 38, 45, 63, 42, 55, 52, 11, 44, 11, 46, 45, 52, 51, 42, 11, 59, 11, 38, 45, 63, 42, 55, 52, 11, 51, 11, 57, 11, 42, 45, 52, 51, 42, 11, 48, 11, 56, 11, 57, 11, 43, 11, 42, 55, 63, 42, 55, 52, 11, 50, 11, 38, 45, 63, 42, 55, 52, 11, 57, 11, 46, 62, 52, 51, 42, 11, 42, 45, 52, 51, 42, 11, 51, 11, 46, 62, 52, 51, 42, 11, 56, 11, 50, 11, 38, 38, 52, 51, 42, 11, 41, 11, 38, 45, 63, 42, 55, 52, 11, 49]
when I run
text.sequence_to_text(phone_id_seq)
the result isdh ihzero s ihzero z ahzero k l ihone p ahzero v jh ehone n erzero eytwo t ahzero d s p iyone ch w ihzero dh dh ahzero g ihone v ahzero n t ehone k s t f erzero m ahzero t iyone ehone n iyone s m aaone d ahzero l
does amphion do this on purpose?
When cfg.preprocess.phone_extractor == "lexicon"
, we convert text to phone sequence based on the dictionary defined in https://raw.githubusercontent.com/open-mmlab/Amphion/main/text/lexicon/librispeech-lexicon.txt
.
For the conversion from phone sequence to phone ID sequence, we currently uses the phoneme set from the https://github.com/HarryHe11/vc-dev/blob/main/text/symbols.py
. However, it should use the phoneme set from the librispeech-lexicon.txt
. I'll refactor this part. Thanks for your feedback.
Hi @arieszhang1994 , If you have any further questions about the TTS resume, feel free to re-open this issue. We are glad to follow up!