fish-speech
fish-speech copied to clipboard
(fish-speech v1.5) bigger real time factor on short texts
Self Checks
- [X] This template is only for bug reports. For questions, please visit Discussions.
- [X] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
- [X] I have searched for existing issues, including closed ones. Search issues
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template and fill in all required fields.
Cloud or Self Hosted
Self Hosted (Docker)
Environment Details
Tesla T4
Steps to Reproduce
- Server starts in Docker as
"python", "-m", "tools.api_server", \
"--listen", "0.0.0.0:8080", \
"--llama-checkpoint-path", "checkpoints/fish-speech-1.5", \
"--decoder-checkpoint-path", "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth", \
"--decoder-config-name", "firefly_gan_vq", \
"--compile", \
"--half" \
-
Upload reference audios
-
Client makes request specifying
reference_id
.
✔️ Expected Behavior
I hope to see a tts latency similar to fish-speech v1.4 at around 500ms for a non-referenced audio generation from a short text with only a few characters.
❌ Actual Behavior
The real time factor for short text chunks is bigger than longer texts.
{"level":"info","timestamp":"2024-12-12T17:59:15.231Z","caller":"mando/engine.go:366","msg":"TTS performance","pid":1,"audio_duration_ms":1646,"latency_ms":1913,"text":"好的,"}
{"level":"info","timestamp":"2024-12-12T17:59:20.715Z","caller":"mando/engine.go:366","msg":"TTS performance","pid":1,"audio_duration_ms":6009,"latency_ms":2822,"text":"让我们开始另一个故事!\n\n在一个神秘的王国里,住着一位勇敢的小骑士,"}
{"level":"info","timestamp":"2024-12-12T17:59:25.770Z","caller":"mando/engine.go:366","msg":"TTS performance","pid":1,"audio_duration_ms":13428,"latency_ms":4964,"text":"名叫亚瑟。亚瑟非常渴望成为一名伟大的骑士,保护他的村庄和朋友们。有一天,村庄里传来了一个坏消息:一条凶猛的龙来到了附近的山上,"}
In my application log above, audio_duration_ms
is the length of the audio and latency_ms
is the tts duration.
The shortest text here had a real-time-factor < 1.