Bugfix: Delay pattern mask is applied twice
There are a few bugfixes / contributions:
-
(Bugfix) In
ParlerTTSForConditionalGeneration::generate(), the delay patter mask is built and applied toinput_idsbefore calling_sample(). However, we do not want to apply the mask until we're inside the_sample()function. The above bug doesn't affect the current inference setup because theinput_idsreturned happens to be the same as what's passed in. -
(Contribution) I've provided an example of how to do audio enrolment to improve the consistency of audio generation in (
helpers/voice_enrolment/enrol.ipynb). I believe it helps, but I'm not sure if it's always better. Nonetheless, I mainly provided this as an example to demonstrate how the bugfix helps: we can try to provide the enrolment as prefix tokens indecoder_input_idsargs when callingmodel.generate(). You should notice that without the bugfix, the audio sounds "crackly", which is because the mask has effectively been applied twice on the prefix. -
(Bugfix) When performing deterministic greedy decoding (by passing in
do_sample=False, there is bug where thelogits_warperis not passed in. I believe this should just beNone(?), which I've commited in this PR. Related to this, I also want to raise an issue that deterministic sampling by settingdo_sample=Falseortemperatute=0.1tends to generate random noise.
Perhaps the bugfixes also need to be applied in ParlerTTSForCausalLM? (I haven't touched this class so I'm not sure about it's intended use)
@ylacombe Would it be possible for you to review this?
Hey @Guppy16, thanks for opening this ! I'll take a look in the coming days!
(Bugfix) In ParlerTTSForConditionalGeneration::generate(), the delay patter mask is built and applied to input_ids before calling _sample(). However, we do not want to apply the mask until we're inside the _sample() function. The above bug doesn't affect the current inference setup because the input_ids returned happens to be the same as what's passed in.
So it's actually a redundant operation that changes nothing, right ? Just want to make sure it's not a bug. When I experiment, it seems that it doesn't change anything
Thanks a lot for reviewing this, as well as your great suggestions! I'll work on this in the coming few days.
- I'd rather have an issue opened that explain how to do your voice enrolment thing than a notebook ! Would you like to write a guide about this and add it to the issues ? I can ping it afterwards
Looks like @apresence has made a start on this! I've added a modified version of ur snippet there (https://github.com/huggingface/parler-tts/issues/139)
Closing because https://github.com/huggingface/parler-tts/pull/141 addressed it! Thanks for the work here!