fish-speech icon indicating copy to clipboard operation
fish-speech copied to clipboard

Fish TTS API Fails to Match Reference Audio Tone and Style

Open AshutoshMipax opened this issue 11 months ago • 8 comments

Self Checks

  • [x] This template is only for bug reports. For questions, please visit Discussions.
  • [x] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
  • [x] I have searched for existing issues, including closed ones. Search issues
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source), Self Hosted (Docker)

Environment Details

Environment Details Operating System: Windows 11 (fully updated) Processor: Intel Core i5 13th Gen GPU: NVIDIA RTX 4050 Python Version: Python 3.12 Relevant Libraries and Versions: torch: 2.4.1 Gradio: 4.44.0 pydub: Latest version installed via pip ffmpeg: Installed and accessible via system PATH (version: 2024-08-01-git)

Steps to Reproduce

Install Fish TTS and dependencies as per the documentation. Run the following code to use the Fish TTS API: i have included the file for the code at the end of the document

from gradio_client import Client, handle_file

client = Client("http://127.0.0.1:7860/") result = client.predict( text="This is a test input.", normalize=True, reference_id="test_reference", reference_audio=handle_file(r"C:\Users\ashu4\Music\Sound\final_new_vocal.wav"), reference_text="", max_new_tokens=0, chunk_length=200, top_p=0.7, repetition_penalty=1.2, temperature=0.7, seed=0, use_memory_cache="on", api_name="/partial" ) print(result) Observe the results: The generated audio chunks do not match the tone, speed, or style of the provided reference audio. In some cases, the first chunk is synthesized as a female voice and the second as a male voice. Stitch the chunks using the following code: python Copy Edit from pydub import AudioSegment

final_audio = AudioSegment.empty() for chunk_path in ["chunk1.wav", "chunk2.wav"]: # Replace with actual chunk paths final_audio += AudioSegment.from_file(chunk_path) final_audio.export("final_output.wav", format="wav") The final output is inconsistent and does not replicate the reference audio style.

fish.py.txt

✔️ Expected Behavior

The generated audio should replicate the tone, speed, and style of the reference audio provided in the reference_audio parameter. All audio chunks should be consistent in voice, tone, and style.

❌ Actual Behavior

The generated audio: Does not match the tone, speed, or style of the provided reference audio. Is inconsistent between chunks (e.g., one chunk is in a male voice, another in a female voice). When running the same input in the Gradio UI, the results are far better and match the reference audio, indicating that the API may not be fully utilizing GPU resources or properly processing the reference audio.

AshutoshMipax avatar Jan 17 '25 12:01 AshutoshMipax

try putting reference_id=“”. and it seemed to work for me.

ANTON728 avatar Feb 23 '25 21:02 ANTON728

I encountered the same error. I used the model "7f92f8afb8ec43bf81429cc1c9199cb1", but the voice tone returned is different every time, and there's even a male voice.

dsdbelynn avatar Mar 04 '25 04:03 dsdbelynn

Same here. I use version 1.5, and the first half voice is male which is not correct, the second half is female which is right.

chaopengio avatar Mar 18 '25 08:03 chaopengio

same here.

Prsaro avatar Mar 28 '25 03:03 Prsaro

v1.5,same here,Is this problem solved?

lymanzhao avatar Apr 25 '25 10:04 lymanzhao

Is there a way to fix one tone

smile-yushu avatar Apr 28 '25 06:04 smile-yushu

reference_id =“”, it is work

lymanzhao avatar Apr 28 '25 08:04 lymanzhao

reference_id = '' is not the solution.

you guys are just passing no speaker to the input.

https://github.com/Picus303/fish-speech/blob/0529fc39171ffefff00913870bb031ccb948a2b5/fish_speech/inference_engine/init.py#L52-L63

        ref_id: str | None = req.reference_id
        prompt_tokens, prompt_texts = [], []
        # Load the reference audio and text based on id, hash, or preprocessed references
        if ref_id is not None:
            prompt_tokens, prompt_texts = self.load_by_id(ref_id, req.use_memory_cache)

        elif req.references:
            prompt_tokens, prompt_texts = self.load_by_hash(
                req.references, req.use_memory_cache
            )

shigabeev avatar May 18 '25 03:05 shigabeev