ChatTTS icon indicating copy to clipboard operation
ChatTTS copied to clipboard

low speaker similarity in zero-shot tts

Open LoganLiu66 opened this issue 9 months ago • 2 comments

Thank you for this great job. When I try to use zero-shot TTS, I found speakers' similarity is low between spk_smp and generated aduio. My prompt audio、prompt_text and generated audio are in audios.zip. What may be the reason for causing this, and is there any advice for improvement, thanks.

    audio_file = 'sample.wav'
    prompt_text = 'I chance to leave him alone, but[uv_break] no[uv_break]. She just wanted to see him again[uv_break]. Anna[uv_break], you don't know how it feels to lose a sister[uv_break].'
    spk_smp = chat.sample_audio_speaker(load_audio(audio_file, 24000))

    params_infer_code = ChatTTS.Chat.InferCodeParams(
        spk_smp=spk_smp,
        txt_smp=prompt_text,
        temperature=0.3,
        top_P=0.7,
        top_K=20
    )
    params_refine_text = ChatTTS.Chat.RefineTextParams(
        prompt='[oral_5]'
    )

    text = "I do love books, but I think I like writing about them more than selling them."
    wav = chat.infer(
        text,
        params_infer_code=params_infer_code,
        split_text=False,
        params_refine_text=params_refine_text
    )
    torchaudio.save("sample_generated.wav", torch.from_numpy(wav[0]).unsqueeze(0), 24000)

LoganLiu66 avatar Mar 03 '25 07:03 LoganLiu66

ZeroShot works best on the audio generated by ChatTTS. If you want to use outside audio, make sure that the audio has good quality and the transcript, txt_smp, is completely identical to the audio, including [lbreak] mark, etc.

fumiama avatar Mar 12 '25 13:03 fumiama

def on_upload_sample(sample_audio_input: Optional[str]) -> str: sample_audio = torch.tensor(load_audio(sample_audio_input, 24000)).to('cpu') spk_smp = chat.sample_audio_speaker(sample_audio) del sample_audio return spk_smp

spk_smb = on_upload_sample(r"input.wav")

params_infer_code = ChatTTS.Chat.InferCodeParams( spk_smp=spk_smb, txt_smp="从 博 弈 论 的 定 义 中 我 们 知 道 [uv_break] , 双 方 [uv_break] 或 者 多 方 [uv_break] 进 行 博 弈 的 最 终 目 的 [uv_break] , 都 是 为 自 己 争 取 [uv_break] 最 大 利 益 [uv_break] 。", )

wav = chat.infer( text, params_infer_code=params_infer_code, )

无法克隆input的声音。

input.wav

jianglin-code avatar Nov 16 '25 08:11 jianglin-code