CosyVoice 使用transformers==4.53.1版本，生成语音会混乱

使用transformers==4.53.1版本，生成语音会混乱，改用4.51.3版本则正常

以下是调用代码

    def speech(
        self,
        input: str,
        voice: Optional[str] = "Chinese Female",
        speed: float = 1,
        reponse_format: str = "mp3",
        **kwargs,
    ) -> str:
        if voice not in self._voices:
            raise ValueError(f"Voice {voice} not supported")

        original_voice = self._get_original_voice(voice)
        model_output = self._model.inference_sft( # 这里调用cosyvoice的方法
            input, original_voice, stream=False, speed=speed
        )
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_file:
            wav_file_path = temp_file.name
            with wave.open(wav_file_path, "wb") as wf:
                wf.setnchannels(1)  # single track
                wf.setsampwidth(2)  # 16-bit
                wf.setframerate(22050)  # Sample rate
                for i in model_output:
                    tts_audio = (
                        (i["tts_speech"].numpy() * (2**15)).astype(np.int16).tobytes()
                    )
                    wf.writeframes(tts_audio)

                output_file_path = convert(wav_file_path, reponse_format, speed)
                return output_file_path

环境： ubuntu22.04 NVIDIA-GeForce-RTX-4090 CosyVoice版本：6b21f8e

Jul 11 '25 02:07 yxf0314

使用的是cosy2吗？俺也一样

Jul 11 '25 08:07 ScottishFold007

使用的是cosy2吗？俺也一样

是的，CosyVoice2-0.5B

Jul 11 '25 08:07 yxf0314

我也刚遇到这个情况不知道是啥原因？都是胡说八道的声音，用的也是官方示例

Jul 11 '25 08:07 ScottishFold007

这个仓库【iic/CosyVoice2-0.5B】下载的模型

Jul 11 '25 08:07 ScottishFold007

生成音频.zip，我上传附件，你听听是不是这个情况

Jul 11 '25 08:07 ScottishFold007

生成音频.zip，我上传附件，你听听是不是这个情况

是的是的，像喝醉酒胡言乱语那样

Jul 11 '25 08:07 yxf0314

生成音频.zip，我上传附件，你听听是不是这个情况

是的是的，像喝醉酒胡言乱语那样

换成transformers==4.40.1，立马好

Jul 11 '25 09:07 ScottishFold007

这应该是个bug，等官方修复

Jul 11 '25 09:07 ScottishFold007

确实是，我为了跑vllm升级了一下就这样了

Jul 11 '25 09:07 qiao131

一直不知道什么原因，依赖一个一个的排查都解决不了，总算搞定了

Jul 12 '25 09:07 BobMind758

无论是4.51.3还是4.40.1我都不行，还是乱读

Jul 23 '25 03:07 flashzq

即便运行了requirements.txt，也还是一样的问题 conformer==0.3.2 diffusers==0.27.2 gdown==5.1.0 gradio==4.32.2 grpcio==1.57.0 grpcio-tools==1.57.0 huggingface-hub==0.23.5 hydra-core==1.3.2 HyperPyYAML==1.2.2 inflect==7.3.1 librosa==0.10.2 lightning==2.2.4 matplotlib==3.7.5 modelscope==1.15.0 networkx==3.1 omegaconf==2.3.0 onnx==1.16.0 onnxruntime==1.18.0 openai-whisper==20231117 protobuf==4.25 pydantic==2.7.0 rich==13.7.1 soundfile==0.12.1 tensorboard==2.14.0 tensorrt-cu12==10.0.1 tensorrt-cu12-bindings==10.0.1 tensorrt-cu12-libs==10.0.1 torch==2.3.1 torchaudio==2.3.1 transformers==4.40.1 uvicorn==0.30.0 wget==3.2 fastapi==0.111.0 fastapi-cli==0.0.4 WeTextProcessing==1.0.3

Jul 23 '25 03:07 flashzq

我的transformers==4.53.2也是会有这个问题，这个节点问题好多qwq，4.51.3可以

Jul 24 '25 08:07 Kydon-ai

This issue is stale because it has been open for 30 days with no activity.

Aug 27 '25 02:08 github-actions[bot]

https://github.com/FunAudioLLM/CosyVoice/issues/1546#issuecomment-3232416350

Aug 28 '25 08:08 double12gzh

我发现分界线是v4.53.0。v4.53.0不可以，但是回退到transformers==4.52.4就正常了。这一次transformer不知道更新了啥导致的

Sep 08 '25 12:09 Rhythmblue

尝试了各个版本的transformers，都是产生乱读，未使用vllm，请教一下有什么办法么

Sep 09 '25 12:09 shanhaidexiamo

即便运行了requirements.txt，也还是一样的问题 conformer==0.3.2 diffusers==0.27.2 gdown==5.1.0 gradio==4.32.2 grpcio==1.57.0 grpcio-tools==1.57.0 huggingface-hub==0.23.5 hydra-core==1.3.2 HyperPyYAML==1.2.2 inflect==7.3.1 librosa==0.10.2 lightning==2.2.4 matplotlib==3.7.5 modelscope==1.15.0 networkx==3.1 omegaconf==2.3.0 onnx==1.16.0 onnxruntime==1.18.0 openai-whisper==20231117 protobuf==4.25 pydantic==2.7.0 rich==13.7.1 soundfile==0.12.1 tensorboard==2.14.0 tensorrt-cu12==10.0.1 tensorrt-cu12-bindings==10.0.1 tensorrt-cu12-libs==10.0.1 torch==2.3.1 torchaudio==2.3.1 transformers==4.40.1 uvicorn==0.30.0 wget==3.2 fastapi==0.111.0 fastapi-cli==0.0.4 WeTextProcessing==1.0.3

请问你是怎么解决的呢

Sep 09 '25 12:09 shanhaidexiamo

这是怎么发现的，太神了

Sep 26 '25 09:09 ajkpix

我的transformers==4.53.2也是会有这个问题，这个节点问题好多qwq，4.51.3可以

我的用transformers 4.51.3也不行，一样是乱音

我启动的是cosyvoice2的模型CosyVoice2-0.5B，启动和合成没有保存，但是语音发音是乱的。 CosyVoice2(args.model_dir, load_jit=True, load_trt=True, load_vllm=True, fp16=True) 我已经按照官方的版本来安装，发现合成出来还是语音混乱的

请问你这边有语音乱音的情况吗？我的问题在这个贴：https://github.com/FunAudioLLM/CosyVoice/issues/1601

Oct 12 '25 11:10 worm128