CosyVoice 使用Fun-CosyVoice3-0.5B-2512模型零样本声音克隆时大概率出现胡言乱语

Describe the bug 使用 FunAudioLLM/Fun-CosyVoice3-0.5B-2512 模型通过最新代码和 cosyvoice.inference_zero_shot 方法进行声音克隆时,大概率出现胡言乱语输出,有时会正常播报。疑问是这种现象属于代码使用方式有误,还是模型存在幻觉或泛化问题?

To Reproduce 调用片段:

results = cosyvoice.inference_zero_shot(
    text,
    prompt_text,
    prompt_wav_path,
    stream=False,
)
for j in results:
    wav = j["tts_speech"]
    if wav.dim() == 1:
        wav = wav.unsqueeze(0)
    elif wav.dim() == 2 and wav.size(0) != 1:
        wav = wav[:1, :]
    torchaudio.save(
        seg_path,
        wav,
        cosyvoice.sample_rate,
        bits_per_sample=16,
    )
    segment_paths. append(seg_path)

复现环境(conda): python=3.10

CUDA: nvcc: NVIDIA (R) Cuda compiler driver Cuda compilation tools, release 12.0, V12.0.140

输入参数:

prompt 看着窗外缓缓下落的雨滴
text 小蝌蚪找妈妈,找啊找,找到一个好朋友。有朋自远方来,非奸即盗。

Expected behavior 应输出与输入文本对应的人声音频。

Additional context

当前推理结果大概率为无关、混乱的语音内容,有时短句可以正常播报
希望判定是代码用法有误,还是模型本身推理稳定性/幻觉问题

感谢!

Dec 25 '25 09:12 TBXark

输入和输出的音频.zip

Dec 25 '25 09:12 TBXark

我也遇到类似，而且推出来正常时还会出现音频开头有参照音频部分内容，该部分音频内容和参考音频相同而音调音色有所有不（应该是文本相同，可能把参考文本也加到推理文本了）

Dec 26 '25 04:12 tao1261060556

在 https://www.modelscope.cn/studios/FunAudioLLM/Fun-CosyVoice3-0.5B 这里试了你的输入，换了几次 seed 输出都正常

Dec 26 '25 15:12 laishere

if mode_checkbox_group == '3s极速复刻':
    logging.info('get zero_shot inference request')
    set_all_random_seed(seed)
    speech_list = []
    for i in cosyvoice.inference_zero_shot(tts_text, 'You are a helpful assistant.<|endofprompt|>' + prompt_text, postprocess(prompt_wav), stream=stream, speed=speed):
        speech_list.append(i['tts_speech'])
    return (target_sr, torch.concat(speech_list, dim=1).numpy().flatten())

这是其中运行的脚本，我估计你可能是没有加 You are a helpful assistant.<|endofprompt|>

Dec 26 '25 15:12 laishere

if mode_checkbox_group == '3s极速复刻':
    logging.info('get zero_shot inference request')
    set_all_random_seed(seed)
    speech_list = []
    for i in cosyvoice.inference_zero_shot(tts_text, 'You are a helpful assistant.<|endofprompt|>' + prompt_text, postprocess(prompt_wav), stream=stream, speed=speed):
        speech_list.append(i['tts_speech'])
    return (target_sr, torch.concat(speech_list, dim=1).numpy().flatten())

这是其中运行的脚本，我估计你可能是没有加 You are a helpful assistant.<|endofprompt|>

确实是没有加 You are a helpful assistant.<|endofprompt|>。不知道还有这茬。回头我加上试一下。

Dec 27 '25 12:12 TBXark