Fish TTS API Fails to Match Reference Audio Tone and Style
Self Checks
- [x] This template is only for bug reports. For questions, please visit Discussions.
- [x] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
- [x] I have searched for existing issues, including closed ones. Search issues
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template and fill in all required fields.
Cloud or Self Hosted
Self Hosted (Source), Self Hosted (Docker)
Environment Details
Environment Details Operating System: Windows 11 (fully updated) Processor: Intel Core i5 13th Gen GPU: NVIDIA RTX 4050 Python Version: Python 3.12 Relevant Libraries and Versions: torch: 2.4.1 Gradio: 4.44.0 pydub: Latest version installed via pip ffmpeg: Installed and accessible via system PATH (version: 2024-08-01-git)
Steps to Reproduce
Install Fish TTS and dependencies as per the documentation. Run the following code to use the Fish TTS API: i have included the file for the code at the end of the document
from gradio_client import Client, handle_file
client = Client("http://127.0.0.1:7860/") result = client.predict( text="This is a test input.", normalize=True, reference_id="test_reference", reference_audio=handle_file(r"C:\Users\ashu4\Music\Sound\final_new_vocal.wav"), reference_text="", max_new_tokens=0, chunk_length=200, top_p=0.7, repetition_penalty=1.2, temperature=0.7, seed=0, use_memory_cache="on", api_name="/partial" ) print(result) Observe the results: The generated audio chunks do not match the tone, speed, or style of the provided reference audio. In some cases, the first chunk is synthesized as a female voice and the second as a male voice. Stitch the chunks using the following code: python Copy Edit from pydub import AudioSegment
final_audio = AudioSegment.empty() for chunk_path in ["chunk1.wav", "chunk2.wav"]: # Replace with actual chunk paths final_audio += AudioSegment.from_file(chunk_path) final_audio.export("final_output.wav", format="wav") The final output is inconsistent and does not replicate the reference audio style.
✔️ Expected Behavior
The generated audio should replicate the tone, speed, and style of the reference audio provided in the reference_audio parameter. All audio chunks should be consistent in voice, tone, and style.
❌ Actual Behavior
The generated audio: Does not match the tone, speed, or style of the provided reference audio. Is inconsistent between chunks (e.g., one chunk is in a male voice, another in a female voice). When running the same input in the Gradio UI, the results are far better and match the reference audio, indicating that the API may not be fully utilizing GPU resources or properly processing the reference audio.
try putting reference_id=“”. and it seemed to work for me.
I encountered the same error. I used the model "7f92f8afb8ec43bf81429cc1c9199cb1", but the voice tone returned is different every time, and there's even a male voice.
Same here. I use version 1.5, and the first half voice is male which is not correct, the second half is female which is right.
same here.
v1.5,same here,Is this problem solved?
Is there a way to fix one tone
reference_id =“”, it is work
reference_id = '' is not the solution.
you guys are just passing no speaker to the input.
https://github.com/Picus303/fish-speech/blob/0529fc39171ffefff00913870bb031ccb948a2b5/fish_speech/inference_engine/init.py#L52-L63
ref_id: str | None = req.reference_id
prompt_tokens, prompt_texts = [], []
# Load the reference audio and text based on id, hash, or preprocessed references
if ref_id is not None:
prompt_tokens, prompt_texts = self.load_by_id(ref_id, req.use_memory_cache)
elif req.references:
prompt_tokens, prompt_texts = self.load_by_hash(
req.references, req.use_memory_cache
)