parler-tts Bugfix: Delay pattern mask is applied twice

There are a few bugfixes / contributions:

(Bugfix) In ParlerTTSForConditionalGeneration::generate(), the delay patter mask is built and applied to input_ids before calling _sample(). However, we do not want to apply the mask until we're inside the _sample() function. The above bug doesn't affect the current inference setup because the input_ids returned happens to be the same as what's passed in.
(Contribution) I've provided an example of how to do audio enrolment to improve the consistency of audio generation in (helpers/voice_enrolment/enrol.ipynb ). I believe it helps, but I'm not sure if it's always better. Nonetheless, I mainly provided this as an example to demonstrate how the bugfix helps: we can try to provide the enrolment as prefix tokens in decoder_input_ids args when calling model.generate(). You should notice that without the bugfix, the audio sounds "crackly", which is because the mask has effectively been applied twice on the prefix.
(Bugfix) When performing deterministic greedy decoding (by passing in do_sample=False, there is bug where the logits_warper is not passed in. I believe this should just be None(?), which I've commited in this PR. Related to this, I also want to raise an issue that deterministic sampling by setting do_sample=False or temperatute=0.1 tends to generate random noise.

Aug 16 '24 16:08 Guppy16

Perhaps the bugfixes also need to be applied in ParlerTTSForCausalLM? (I haven't touched this class so I'm not sure about it's intended use)

Aug 19 '24 09:08 Guppy16

@ylacombe Would it be possible for you to review this?

Aug 21 '24 17:08 Guppy16

Hey @Guppy16, thanks for opening this ! I'll take a look in the coming days!

Sep 02 '24 11:09 ylacombe

(Bugfix) In ParlerTTSForConditionalGeneration::generate(), the delay patter mask is built and applied to input_ids before calling _sample(). However, we do not want to apply the mask until we're inside the _sample() function. The above bug doesn't affect the current inference setup because the input_ids returned happens to be the same as what's passed in.

So it's actually a redundant operation that changes nothing, right ? Just want to make sure it's not a bug. When I experiment, it seems that it doesn't change anything

Sep 05 '24 09:09 ylacombe

Thanks a lot for reviewing this, as well as your great suggestions! I'll work on this in the coming few days.

Sep 25 '24 18:09 Guppy16

I'd rather have an issue opened that explain how to do your voice enrolment thing than a notebook ! Would you like to write a guide about this and add it to the issues ? I can ping it afterwards

Looks like @apresence has made a start on this! I've added a modified version of ur snippet there (https://github.com/huggingface/parler-tts/issues/139)

Sep 27 '24 13:09 Guppy16

Closing because https://github.com/huggingface/parler-tts/pull/141 addressed it! Thanks for the work here!

Oct 14 '24 12:10 ylacombe