chatterbox
chatterbox copied to clipboard
Gibberish and hallucinations with short segments
Hi, short segments like "Hi!", "Why?", "Yes", "No", or single letters or numbers tend to produce gibberish or hallucinations, sometimes even for longer segments.
I’ve already tried playing around with cfg, exaggeration and temperature parameters, as well as trying to find a fixed seed that works, but to no avail.
Is there any way to force the model to produce reliable results for short segments?
I've been getting this as well. I'm working on an audiobook generator and sometimes when a character's line is very short, I get this. I've tried it both on GPU with CUDA and also CPU to see if that was the issue.
@psdwizzard did you figure out what could cause these noises? If not, I'd figure it could be the model. Perhaps, more fine-tuning would be needed?
I haven't personally figured it out. My working theory is that there just isn't a lot of very short clips in the training data. But it also could be the way something's implemented. I know for a while there I was getting weird CUDA issues and we were able to figure that out. But I don't know if this is something that's along those lines or the model training data lines.
I haven't personally figured it out. My working theory is that there just isn't a lot of very short clips in the training data. But it also could be the way something's implemented. I know for a while there I was getting weird CUDA issues and we were able to figure that out. But I don't know if this is something that's along those lines or the model training data lines.
Oh, so by fixing your CUDA issues, you manged to produce audio that doesn't have any artifacts?
I’ve been playing around with a few open source TTS implementations (F5, Zonos, fish, edge), they pretty much all tend to fall apart when the input text is too short, or when the input is somehow malformed (like not ending on a full-stop, etc.). So this seems to be a general problem. The only one that works a bit better for me in this regard has been Kokoro.
Anyone know any good workflows for removing distortions in general? Current im planning on using whisper to validate the audio but im wondering if there's a better way
Anyone know any good workflows for removing distortions in general? Current im planning on using whisper to validate the audio but im wondering if there's a better way
My fork does just this. It uses Whisper Sync to validate the audio. And that's not all. Here's what I got going on.
To avoid audio problems with short sentences, if a sentence is below 20 characters, it gets added to the sentence before or after it. To avoid artifacts and hallucinations, my fork generates multiple candidates per chunk, which are user specified, default is 3. After all the audio chunks are generated it goes through and transcribes them to make sure they match their input text. Then it picks the shortest candidate of that chunk that passed the whisper sync validation. The idea here is that artifacts and hallucinations generally cause the sample to be longer in duration. Picking the shortest sample lowers the probability of having these in your output. Also if a candidate fails the whisper sync validation, it is set by default to retry to generate that candidate up to 3 times by default, which is also user specified. If all candidates fail, it will either pick the candidate that has the highest whisper sync fuzzy matching score, or the one that has the most characters, based on the user choice before generation. It get around low sounding hallucinations, like low extended breathing sounds, I use auto-editor. Auto-editor sets a volume threshold by which everything under that threshold will be removed. It also sets a "margin" which can allow a buffer of time before and after it cuts.
The user can also select to entirely bypass the whisper sync checking, for faster audio generation.
I developed my fork for the purpose of generating audio books with my voice for one of my kids.
is anyone facing the sentence repeating problem in the output.sometimes it just repeat the same sentence two three times
is anyone facing the sentence repeating problem in the output.sometimes it just repeat the same sentence two three times
That was exactly why I came to GitHub just now. A majority of the paragraphs I ran went fine until the last sentence or so where it would repeat a few words.
is anyone facing the sentence repeating problem in the output.sometimes it just repeat the same sentence two three times
That was exactly why I came to GitHub just now. A majority of the paragraphs I ran went fine until the last sentence or so where it would repeat a few words.
I am able to solve this by keeping the chunk size 200 characters with sentence ending at full stop (.)
I’ve been playing around with a few open source TTS implementations (F5, Zonos, fish, edge), they pretty much all tend to fall apart when the input text is too short, or when the input is somehow malformed (like not ending on a full-stop, etc.). So this seems to be a general problem. The only one that works a bit better for me in this regard has been Kokoro.
First, I want to say thanks. This is a great model and I'm really enjoying using it in the fork I'm working on right now. I'm wondering if you'd made any progress on figuring out what the issue is with the really short generations. I know a lot of models like you mentioned do struggle with this, but I do know that XTTS2 seems to handle very short generations of one word or even short words very well. I know they're fundamentally different types of models, but maybe it's something we can look at to figure out why that does handle it and this doesn't maybe fix the issue.
I have tested petermg fork and ot works great. psdwizzard do you know when the integration with your audiobook fork will be added?
I have tested petermg fork and ot works great. psdwizzard do you know when the integration with your audiobook fork will be added?
I'll be honest I've been kind of busy with other projects lately and I hadn't seen this so I appreciate the tag. I'll check it out tomorrow.