fish-speech The Minion randomly intruded into my audio

Self Checks

[X] This template is only for bug reports. For questions, please visit Discussions.
[X] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文日本語 Portuguese (Brazil)
[X] I have searched for existing issues, including closed ones. Search issues
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

Nvidia3090, Python 3.10, torch==2.4.1, torchvision==0.19.1, torchaudio==2.4.1

Steps to Reproduce

/root/miniconda3/bin/python -m tools.api_server --listen 0.0.0.0:6006 --llama-checkpoint-path "/usr/github/fish-speech/checkpoints/fish-speech-1.5" --decoder-checkpoint-path "/usr/github/fish-speech/checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" --decoder-config-name firefly_gan_vq --compile

✔️ Expected Behavior

No response

❌ Actual Behavior

Please listen to the last few seconds of this audio, where a Minion's voice appears. https://saysay-bucket1.s3.us-west-1.amazonaws.com/uploads/default/20241224/4f75658f38eb0c163acced94328a73b6e78275bb.mp3 Text: By the end of this century, we will have reached a technological singularity, where quantum computing leads to a paradigm shift in epistemology. This is a probabilistic issue. Out of my 100 audio files, 13 have similar occurrences. Please help me, how should I solve this problem?

Dec 25 '24 10:12 Haoran1272

listen this: https://saysay-bucket1.s3.us-west-1.amazonaws.com/uploads/default/20250102/6f025e260633d038db74c5fd218098975e505a77.mp3

Jan 02 '25 09:01 Haoran1272

This happens all the time for me. Generate a few and choose the median length one.

Jan 07 '25 02:01 mashdragon

In my case, I need to preprocess to remove * , for example *args-> args. This way some error sounds will not be generated, but there are more steps that may require preprocessing that I didn't see.

Jan 10 '25 02:01 20km-shimakaze

@20km-shimakaze Which characters have you found that you need to remove? Or instead, which character sets do you keep?

Jan 10 '25 16:01 mashdragon

I got the same problem when generate Chinese, and it seems to occur randomly, the same sentence could generated correctly when you test it.

Jan 13 '25 02:01 Ginzyl

I haven't found a solution yet, but increasing the audio duration of the material seems to reduce the probability of occurrence.

Jan 13 '25 02:01 Haoran1272

@20km-shimakaze Which characters have you found that you need to remove? Or instead, which character sets do you keep?您发现了哪些需要删除的角色？或者，您保留哪些字符集？

I deleted the character '*'.Anyway, this character is not pronounced during TTS, and I myself have found that even after generating the audio several times, the pronunciation will be wrong after reading this character. So I deleted it.

Jan 17 '25 14:01 20km-shimakaze

I haven't found a solution yet, but increasing the audio duration of the material seems to reduce the probability of occurrence.

So how long the audio can reduce these Minion

Mar 16 '25 17:03 AiBoBoMaker

We approved stability this version. You can have a try. I'll close the issue in 7 days if there's no more question.

Jun 05 '25 08:06 Whale-Dolphin

Can you explain how stability was improved? And when you say "this version" are you talking about openaudio-s1-mini?

Jun 05 '25 17:06 mashdragon

The instruction is not that stable on s1, you maybe can use the 1.6 version on our website. We are working hard to improve the instruction generation to make it more stable. You may just generate a little more sentence to solve the problem now. Thanks for your supports!

Sep 21 '25 05:09 Whale-Dolphin