"last one means y is already too long, shouldn't happen, but put it here"

Open royrs opened this issue 7 months ago • 0 comments

When working with the model I encountered some situations where the desired text was removed, but nothing was generated instead of it. After some investigation, I found the following line in your code: https://github.com/jasonppy/VoiceCraft/blob/a702dfd2ced6d4fd6b04bdc160c832c6efc8f6c5/models/voicecraft.py#L752 which checks if y_input > 10 * x_lens and if so, it doesn't generate anything.

Why do we need this check? I'm not sure why the target transcript length and the input size should limit our generation. In the code you wrote it should happen, but it might happen if the audio doesn't include a lot of words, but it is longer because of silences in it.

All audios I tested are 4~5 seconds as you suggest it works best for.

I tried removing this check and for the few examples I tried it gave good results.

May 07 '25 07:05 royrs