VoiceCraft
VoiceCraft copied to clipboard
some Voice editing problem
I have noticed some testing and demo issues regarding voice editing I would like to ask you about when you edit the last part of the text, for example: https://youtu.be/PJ2qSjycLcw?t=353, after starting at 5:50, there will be a problem with the synthesis quality at the end of the sentence, I prefer the bad audio is not mask in two parts but edit it the end of the sentence. So I found your demo "this was george steers the son of a british naval captain and ship modeler who had become an american naval officer and was entrusted with the prestigious role of overseeing the operations at the renowned naval headquarters" editing in the end of the sentence.There will also be strange pauses at the end of the sentence between the last few words.
Thanks! I'm not sure I understand your question. If you meant to ask how to reduce unnatural pauses in the generation, try reducing the stop_repetition param to 1 or 2, or simple generate a few samples and select the shortest one (speech editing code doesn't support parallel decoding at this point
In general, VoiceCraft is trained on Gigaspeech, where most of the utterances have length 4~5 second, although the longest ones are ~20sec. So I would expect the model to perform worse on long generation. This problem would likely go away if you train/finetune the model on datasets of long sentences
Yes, thanks for your answer It is indeed possible that it is related to long sentences. The examples I gave you on YouTube and your demo seem to be related to long sentences. However, because the unnatural pause positions in words are at the end of the sentence. The solution you gave in inference is to adjust the stop_repetition param to 1 or 2. If you want to make permanent adjustments, is it better to use long story sentences like LibriSpeech for training?
Yea, the average length of gigaspeech is 5 sec, (even though there are many utterances in gigaspeech is >15 sec) so there is dataset bias in there. Finetune the model on longform utterance could help