CosyVoice vllm for cosyvoice3

什么时候出CosyVoice3的vllm版本？

Dec 16 '25 07:12 tujie-jiangye

支持vllm

Dec 16 '25 08:12 11075225

vllm在wsl中用不了啊，rtx 3070 8G，Windows 11

INFO 12-16 18:05:19 [worker.py:291] the current vLLM instance can use total_gpu_memory (8.00GiB) x gpu_memory_utilization (0.40) = 3.20GiB INFO 12-16 18:05:19 [worker.py:291] model weights take 0.70GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 1.12GiB; the rest of the memory reserved for KV Cache is 1.34GiB. INFO 12-16 18:05:19 [executor_base.py:112] # cuda blocks: 7343, # CPU blocks: 21845 INFO 12-16 18:05:19 [executor_base.py:117] Maximum concurrency for 32768 tokens per request: 3.59x INFO 12-16 18:05:20 [model_runner.py:1512] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 70/70 [00:45<00:00, 1.53it/s] INFO 12-16 18:06:06 [model_runner.py:1670] Graph capturing finished in 42 secs, took 0.31 GiB INFO 12-16 18:06:06 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 49.67 seconds 2025-12-16 18:06:11,095 DEBUG Starting new HTTPS connection (1): stats.vllm.ai:443 2025-12-16 18:06:11,237 INFO Converting onnx to trt... [12/16/2025-18:06:11] [TRT] [I] [MemUsageChange] Init CUDA: CPU -2, GPU +0, now: CPU 10680, GPU 7514 (MiB) [12/16/2025-18:06:28] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU -837, GPU +4, now: CPU 9642, GPU 7518 (MiB) [rank0]: Traceback (most recent call last): [rank0]: File "/home/zhu/CosyVoice/vllm_example.py", line 39, in [rank0]: main() [rank0]: File "/home/zhu/CosyVoice/vllm_example.py", line 35, in main [rank0]: cosyvoice3_example() [rank0]: File "/home/zhu/CosyVoice/vllm_example.py", line 25, in cosyvoice3_example [rank0]: cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B', load_trt=True, load_vllm=True, fp16=False) [rank0]: File "/home/zhu/CosyVoice/cosyvoice/cli/cosyvoice.py", line 236, in AutoModel [rank0]: return CosyVoice3(**kwargs) [rank0]: File "/home/zhu/CosyVoice/cosyvoice/cli/cosyvoice.py", line 221, in init [rank0]: self.model.load_trt('{}/flow.decoder.estimator.{}.mygpu.plan'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'), [rank0]: File "/home/zhu/CosyVoice/cosyvoice/cli/model.py", line 85, in load_trt [rank0]: convert_onnx_to_trt(flow_decoder_estimator_model, self.get_trt_kwargs(), flow_decoder_onnx_model, fp16) [rank0]: File "/home/zhu/CosyVoice/cosyvoice/utils/file_utils.py", line 67, in convert_onnx_to_trt [rank0]: with open(onnx_model, "rb") as f: [rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'pretrained_models/Fun-CosyVoice3-0.5B/flow.decoder.estimator.fp32.onnx'

Dec 16 '25 16:12 kanzhuzhu

VLLM已经在WSL跑起来了，VLLM_Example.py中的 cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B', load_trt=False, load_vllm=True, fp16=True) 这里的load_trt 源代码设置为True，改为false跳过加载就可以成功运行VLLM，另外webui跑不起来，升级下gradio版本就好了

Dec 17 '25 07:12 kanzhuzhu

fp16=True的话你的能够正常使用吗？我的是生成的音频没有任何声音。我用Linux服务器部署设置load_trt=True是能够正常使用的。

Dec 17 '25 07:12 11075225

fp16=True的话你的能够正常使用吗？我的是生成的音频没有任何声音。我用Linux服务器部署设置load_trt=True是能够正常使用的。

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B', load_trt=False, load_vllm=True, fp16=True) 是正常的，只是有些文本截断过长，导致语音缺失，另外load_trt=True以后不是会出现'pretrained_models/Fun-CosyVoice3-0.5B/flow.decoder.estimator.fp32.onnx' 这个文件找不到吗？你目录下有这个文件啊？

Dec 17 '25 08:12 kanzhuzhu

去魔塔社区而不是huggingface能下载flow.decoder.estimator.fp32.onnx这个文件：）

Dec 17 '25 10:12 WenyuanLi212

fp16=True的话你的能够正常使用吗？我的是生成的音频没有任何声音。我用Linux服务器部署设置load_trt=True是能够正常使用的。

我遇到了和你一样的问题，设置fp16=True生成的音频没有任何声音，而且似乎只有同时设置load_trt=True和fp16=True才会出现，请问你有发现原因吗

Dec 23 '25 14:12 BCILiang

fp16=True的话你的能够正常使用吗？我的是生成的音频没有任何声音。我用Linux服务器部署设置load_trt=True是能够正常使用的。

我遇到了和你一样的问题，设置fp16=True生成的音频没有任何声音，而且似乎只有同时设置load_trt=True和fp16=True才会出现，请问你有发现原因吗

我解决了这个问题，是因为我安装了低版本的tensorrt，：）

Dec 24 '25 02:12 BCILiang

fp16=True的话你的能够正常使用吗？我的是生成的音频没有任何声音。我用Linux服务器部署设置load_trt=True是能够正常使用的。

我遇到了和你一样的问题，设置fp16=True生成的音频没有任何声音，而且似乎只有同时设置load_trt=True和fp16=True才会出现，请问你有发现原因吗

我解决了这个问题，是因为我安装了低版本的tensorrt，：）

方便问一下你的tensorrt和vllm的版本吗？

Dec 24 '25 03:12 11075225

fp16=True的话你的能够正常使用吗？我的是生成的音频没有任何声音。我用Linux服务器部署设置load_trt=True是能够正常使用的。

我遇到了和你一样的问题，设置fp16=True生成的音频没有任何声音，而且似乎只有同时设置load_trt=True和fp16=True才会出现，请问你有发现原因吗

我解决了这个问题，是因为我安装了低版本的tensorrt，：）

方便问一下你的tensorrt和vllm的版本吗？ vllm==0.9.0, tensorrt-cu12, tensorrt-cu12-bindings, tensorrt-libs都是10.13.3.9

Dec 25 '25 01:12 BCILiang