fish-speech
fish-speech copied to clipboard
Partially garbled audio on Huggingface online demo, for a short English input
Self Checks
- [X] This template is only for bug reports. For questions, please visit Discussions.
- [X] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
- [X] I have searched for existing issues, including closed ones. Search issues
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template and fill in all required fields.
Cloud or Self Hosted
Cloud
Environment Details
Huggingface online demo (V1.5
medium)
Steps to Reproduce
Tried to synthesize the text "We are not responsible for any misuse of the model, please consider your local laws and regulations before using it."
All parameters are left as default:
✔️ Expected Behavior
Normal speech.
❌ Actual Behavior
The audio starts normally as "We are not", and then followed by garbled audio, that sounds sped-up. The total audio duration is only 3 seconds.
https://github.com/user-attachments/assets/45b1d12c-575c-457d-bfc4-45a5f9c6f634
I got this after trying the model with only 2 test inputs, meaning that it's not that rare. If I try to synthesize the same text several times again, I get other voices, and they don't seem to have this issue (as much as I've tested).
Related issues
Seems closely related to issue #632, but I decided to open a new issue because:
- My input was in English
- Issue #632 is described as "Swallowing words, reading normally at first, then speeding up, and then not reading the last word of the sentence completely", but it's not exactly what is seen here. Here it starts normally but continues with a completely garbled audio
- A comment on that issue says the issue is resolved
- It was produced by the online demo, and the latest model (1.5 medium)