vall-e
vall-e copied to clipboard
have someone ever tried this repo on other languages and got good performance?
have someone ever tried this repo on other languages and got good performance? 50 hours of toy data seem didn't get intelligibility.
Hi, @MisakaMikoto96 Sorry but could you get the good result with English? I can not generate audio by using trained model with some data. Result is just noise not voice. Looking forward to your reply. Regards! Petar
Hi, @MisakaMikoto96 Sorry but could you get the good result with English? I can not generate audio by using trained model with some data. Result is just noise not voice. Looking forward to your reply. Regards! Petar
I only tried it on a 1-hour nano mandarin data, and I set the input prompts to be itself(do not use self.sample_prompts in data.py), I got the human voice as its overfit in my dataset (test by input a transcription and its related audio, it is able to reconstruct the audio). The reproduction of @enhuiz ‘s work seems some different from the paper. May I ask why the sample_prompt is in data processing and only choose the qnt not the <phn, qnt>?
And also in the infere nce stage, the paper prefers to input "text_prompt" + "text_to_be_gen" + "audio_prompt", is any explanation in your code?
Really thanks for your work!
Hi, @MisakaMikoto96 Thanks for your message. Sorry but would you like to share the code? (data.py, config.py, ar.yml) Looking forward to hearing from you. Best Regards! Petar
i've the same confusion. But i found that this kind of work can let us infer from the target text, but not concat from the prefix text+target text like the described in paper
And also in the infere nce stage, the paper prefers to input "text_prompt" + "text_to_be_gen" + "audio_prompt", is any explanation in your code?
this can be lead a problem when your acoustic prompt are not consistent to the prefix text