CosyVoice 请教一下当前9月更新后，优化了推理速度，使用上有什么改变么？

当前使用新代码输出后，感觉貌似速度没有提升.....使用上需要什么额外操作么 QQ_1725594468373

Sep 06 '24 03:09 EvilCalf

同时使用流式，显存能直接飙升到A800 80G的86%

Sep 06 '24 04:09 EvilCalf

grpc 里有个导入cosyvoice_pb2 ，这个貌似没有

Sep 06 '24 04:09 EvilCalf

we tested on v100, rtf reduced from 1.2 to 0.8, we use libtorch and onnx for inference, check your code for inconsistency

Sep 06 '24 05:09 aluminumbox

we tested on v100, rtf reduced from 1.2 to 0.8, we use libtorch and onnx for inference, check your code for inconsistency

所以还是直接使用cosyvoice.inference_sft，stream=true对吧

Sep 06 '24 05:09 EvilCalf

yes, as for the memory increase, we have removed torch.cuda.empty_cache() in model.py, because in service, it will empty cache of all threads, which increase time. if your memory increase too much, you can keep it

Sep 06 '24 06:09 aluminumbox

QQ_1725605788422 我跑了10次同样的文本，理论是慢慢加快到一个平衡的最终值，但是后面会出现卡住，gpu占用飙升时间也加长

Sep 06 '24 06:09 EvilCalf

yes we have notice that the final chunk rtf is much higher, we are also looking into it

Sep 06 '24 07:09 aluminumbox

we have tested the code, there is nothing wrong with the code. the final chunk has different length, so the onnx run needs more warmup step.

Sep 06 '24 07:09 aluminumbox

额所以跑了10遍，还在warm up阶段么，可是并不是每次最后一个chunk都有高延迟

Sep 06 '24 08:09 EvilCalf

yes, you can set use_onnx=False to disable onnx, or you can run for like 50 inference, the last chunk time consumption became stable

Sep 06 '24 08:09 aluminumbox

请问你们现在自己warmup有什么好的技巧么，50个不同的文本？长度？

Sep 06 '24 08:09 EvilCalf

we have updated the code, set load_onnx=False, seems like onnx is not very stable, both on rtf and gpu memory

Sep 06 '24 09:09 aluminumbox

感觉明显好了不少😭，另外还有一个问题，能直接接流式输入么

Sep 06 '24 10:09 EvilCalf

just input sentence sequentially

Sep 06 '24 10:09 aluminumbox

感觉明显好了不少😭，另外还有一个问题，能直接接流式输入么

您好，请教一下在A800下面，这句示例文本也得要2秒左右才能完成是吗？我想知道极限推理速度，确认能否用在实时性比较高的场景下面使用

Sep 06 '24 13:09 big-lilijiang

On my M3 Max laptop, I have the "30s/it" speed (Latest code, Latest model on 🤗)

Sep 07 '24 01:09 yybawang

在A4000上，我拉最新9.6的，Load_jit=True, 明显感觉到比之前的版本慢了很多，请问为什么？是我哪里没配置对吗？看到main分支merge代码有一些内容被回滚了load_trt那些部分，有关联吗。。。同样的内容原来只要一半的时间。。。

现在的如下： tn 这里假设音频是单声道（channels = 1），采样宽度为 2 字节（通常对应 16 位音频），帧率为 44100Hz。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 to 这里假设音频是单声道（channels 等于一），采样宽度为二字节（通常对应十六位音频），帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 0%| | 0/2 [00:00<?, ?it/s]2024-09-13 01:37:22,763 INFO synthesis text 这里假设音频是单声道channels等于一，采样宽度为二字节通常对应十六位音频，帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。 2024-09-13 01:37:28,782 INFO yield speech len 1.7647165532879818, rtf 3.410812273106881, total cost 6.019117832183838s 2024-09-13 01:37:32,238 INFO yield speech len 1.9969160997732427, rtf 1.7305504500887594, total cost 3.455765724182129s 2024-09-13 01:37:35,646 INFO yield speech len 1.9969160997732427, rtf 1.7062727597455472, total cost 3.4072844982147217s 2024-09-13 01:37:39,002 INFO yield speech len 1.9969160997732427, rtf 1.6804975181341517, total cost 3.355813503265381s 2024-09-13 01:37:42,460 INFO yield speech len 1.9969160997732427, rtf 1.7319099826001843, total cost 3.4584801197052s 2024-09-13 01:37:45,500 INFO yield speech len 1.9969160997732427, rtf 1.5222031374050433, total cost 3.039712905883789s 2024-09-13 01:37:47,781 INFO yield speech len 1.9969160997732427, rtf 1.1422677065765614, total cost 2.2810139656066895s 2024-09-13 01:37:50,112 INFO yield speech len 1.9969160997732427, rtf 1.16714964685745, total cost 2.330700635910034s 2024-09-13 01:37:52,416 INFO yield speech len 1.486077097505669, rtf 1.5497804244660074, total cost 2.3030943870544434s 50%|█████████████████████████████████████████████████████████████████████▌ | 1/2 [00:30<00:30, 30.12s/it]2024-09-13 01:37:52,521 INFO synthesis text 同时，这种方法使用了临时文件来将NumPy数组转换为、wav格式，如果需要更高效的方法，可以考虑直接在内存中进行、wav文件的构建而不使用临时文件。 2024-09-13 01:37:57,964 INFO yield speech len 1.7647165532879818, rtf 3.084643530708395, total cost 5.443522691726685s 2024-09-13 01:38:01,451 INFO yield speech len 1.9969160997732427, rtf 1.7460250271332645, total cost 3.4866673946380615s 2024-09-13 01:38:05,087 INFO yield speech len 1.9969160997732427, rtf 1.820409257062386, total cost 3.6352062225341797s 2024-09-13 01:38:08,681 INFO yield speech len 1.9969160997732427, rtf 1.800061997058693, total cost 3.594573974609375s 2024-09-13 01:38:12,367 INFO yield speech len 1.9969160997732427, rtf 1.845703701499503, total cost 3.6857166290283203s 2024-09-13 01:38:14,757 INFO yield speech len 1.9969160997732427, rtf 1.196570088003957, total cost 2.389451503753662s 2024-09-13 01:38:17,185 INFO yield speech len 1.9969160997732427, rtf 1.2156676574133682, total cost 2.4275879859924316s 2024-09-13 01:38:19,820 INFO yield speech len 1.8692063492063493, rtf 1.4096638106781503, total cost 2.634953737258911s 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:57<00:00, 28.76s/it] infer time: 70.34108649999999

原来的： tn 这里假设音频是单声道（channels = 1），采样宽度为 2 字节（通常对应 16 位音频），帧率为 44100Hz。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 to 这里假设音频是单声道（channels 等于一），采样宽度为二字节（通常对应十六位音频），帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。分割结果: ['这里假设音频是单声道channels等于一，采样宽度为二字节通常对应十六位音频，帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。', '同时，这种方法使用了临时文件来将NumPy数组转换为、wav格式，如果需要更高效的方法，可以考虑直接在内存中进行、wav文件的构建而不使用临时文件。'] llm cost: 15.500642776489258 s flow cost: 3.4648516178131104 s hifigan cost: 0.1598224639892578 s llm cost: 13.476516723632812 s flow cost: 3.245485544204712 s hifigan cost: 0.14189839363098145 s infer time: 33.4950243

Sep 13 '24 01:09 LzyloveRila

在A4000上，我拉最新9.6的，Load_jit=True, 明显感觉到比之前的版本慢了很多，请问为什么？是我哪里没配置对吗？看到main分支merge代码有一些内容被回滚了load_trt那些部分，有关联吗。。。同样的内容原来只要一半的时间。。。

现在的如下： tn 这里假设音频是单声道（channels = 1），采样宽度为 2 字节（通常对应 16 位音频），帧率为 44100Hz。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 to 这里假设音频是单声道（channels 等于一），采样宽度为二字节（通常对应十六位音频），帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 0%| | 0/2 [00:00<?, ?it/s]2024-09-13 01:37:22,763 INFO synthesis text 这里假设音频是单声道channels等于一，采样宽度为二字节通常对应十六位音频，帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。 2024-09-13 01:37:28,782 INFO yield speech len 1.7647165532879818, rtf 3.410812273106881, total cost 6.019117832183838s 2024-09-13 01:37:32,238 INFO yield speech len 1.9969160997732427, rtf 1.7305504500887594, total cost 3.455765724182129s 2024-09-13 01:37:35,646 INFO yield speech len 1.9969160997732427, rtf 1.7062727597455472, total cost 3.4072844982147217s 2024-09-13 01:37:39,002 INFO yield speech len 1.9969160997732427, rtf 1.6804975181341517, total cost 3.355813503265381s 2024-09-13 01:37:42,460 INFO yield speech len 1.9969160997732427, rtf 1.7319099826001843, total cost 3.4584801197052s 2024-09-13 01:37:45,500 INFO yield speech len 1.9969160997732427, rtf 1.5222031374050433, total cost 3.039712905883789s 2024-09-13 01:37:47,781 INFO yield speech len 1.9969160997732427, rtf 1.1422677065765614, total cost 2.2810139656066895s 2024-09-13 01:37:50,112 INFO yield speech len 1.9969160997732427, rtf 1.16714964685745, total cost 2.330700635910034s 2024-09-13 01:37:52,416 INFO yield speech len 1.486077097505669, rtf 1.5497804244660074, total cost 2.3030943870544434s 50%|█████████████████████████████████████████████████████████████████████▌ | 1/2 [00:30<00:30, 30.12s/it]2024-09-13 01:37:52,521 INFO synthesis text 同时，这种方法使用了临时文件来将NumPy数组转换为、wav格式，如果需要更高效的方法，可以考虑直接在内存中进行、wav文件的构建而不使用临时文件。 2024-09-13 01:37:57,964 INFO yield speech len 1.7647165532879818, rtf 3.084643530708395, total cost 5.443522691726685s 2024-09-13 01:38:01,451 INFO yield speech len 1.9969160997732427, rtf 1.7460250271332645, total cost 3.4866673946380615s 2024-09-13 01:38:05,087 INFO yield speech len 1.9969160997732427, rtf 1.820409257062386, total cost 3.6352062225341797s 2024-09-13 01:38:08,681 INFO yield speech len 1.9969160997732427, rtf 1.800061997058693, total cost 3.594573974609375s 2024-09-13 01:38:12,367 INFO yield speech len 1.9969160997732427, rtf 1.845703701499503, total cost 3.6857166290283203s 2024-09-13 01:38:14,757 INFO yield speech len 1.9969160997732427, rtf 1.196570088003957, total cost 2.389451503753662s 2024-09-13 01:38:17,185 INFO yield speech len 1.9969160997732427, rtf 1.2156676574133682, total cost 2.4275879859924316s 2024-09-13 01:38:19,820 INFO yield speech len 1.8692063492063493, rtf 1.4096638106781503, total cost 2.634953737258911s 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:57<00:00, 28.76s/it] infer time: 70.34108649999999

原来的： tn 这里假设音频是单声道（channels = 1），采样宽度为 2 字节（通常对应 16 位音频），帧率为 44100Hz。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 to 这里假设音频是单声道（channels 等于一），采样宽度为二字节（通常对应十六位音频），帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。分割结果: ['这里假设音频是单声道channels等于一，采样宽度为二字节通常对应十六位音频，帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。', '同时，这种方法使用了临时文件来将NumPy数组转换为、wav格式，如果需要更高效的方法，可以考虑直接在内存中进行、wav文件的构建而不使用临时文件。'] llm cost: 15.500642776489258 s flow cost: 3.4648516178131104 s hifigan cost: 0.1598224639892578 s llm cost: 13.476516723632812 s flow cost: 3.245485544204712 s hifigan cost: 0.14189839363098145 s infer time: 33.4950243

I have the same case, the time consuming of the first chunk based on Load_jit=True is very high.

Sep 13 '24 02:09 shanhaidexiamo

在A4000上，我拉最新9.6的，Load_jit=True, 明显感觉到比之前的版本慢了很多，请问为什么？是我哪里没配置对吗？看到main分支merge代码有一些内容被回滚了load_trt那些部分，有关联吗。。。同样的内容原来只要一半的时间。。。现在的如下： tn 这里假设音频是单声道（channels = 1），采样宽度为 2 字节（通常对应 16 位音频），帧率为 44100Hz。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 to 这里假设音频是单声道（channels 等于一），采样宽度为二字节（通常对应十六位音频），帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 0%| | 0/2 [00:00<?, ?it/s]2024-09-13 01:37:22,763 INFO synthesis text 这里假设音频是单声道channels等于一，采样宽度为二字节通常对应十六位音频，帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。 2024-09-13 01:37:28,782 INFO yield speech len 1.7647165532879818, rtf 3.410812273106881, total cost 6.019117832183838s 2024-09-13 01:37:32,238 INFO yield speech len 1.9969160997732427, rtf 1.7305504500887594, total cost 3.455765724182129s 2024-09-13 01:37:35,646 INFO yield speech len 1.9969160997732427, rtf 1.7062727597455472, total cost 3.4072844982147217s 2024-09-13 01:37:39,002 INFO yield speech len 1.9969160997732427, rtf 1.6804975181341517, total cost 3.355813503265381s 2024-09-13 01:37:42,460 INFO yield speech len 1.9969160997732427, rtf 1.7319099826001843, total cost 3.4584801197052s 2024-09-13 01:37:45,500 INFO yield speech len 1.9969160997732427, rtf 1.5222031374050433, total cost 3.039712905883789s 2024-09-13 01:37:47,781 INFO yield speech len 1.9969160997732427, rtf 1.1422677065765614, total cost 2.2810139656066895s 2024-09-13 01:37:50,112 INFO yield speech len 1.9969160997732427, rtf 1.16714964685745, total cost 2.330700635910034s 2024-09-13 01:37:52,416 INFO yield speech len 1.486077097505669, rtf 1.5497804244660074, total cost 2.3030943870544434s 50%|█████████████████████████████████████████████████████████████████████▌ | 1/2 [00:30<00:30, 30.12s/it]2024-09-13 01:37:52,521 INFO synthesis text 同时，这种方法使用了临时文件来将NumPy数组转换为、wav格式，如果需要更高效的方法，可以考虑直接在内存中进行、wav文件的构建而不使用临时文件。 2024-09-13 01:37:57,964 INFO yield speech len 1.7647165532879818, rtf 3.084643530708395, total cost 5.443522691726685s 2024-09-13 01:38:01,451 INFO yield speech len 1.9969160997732427, rtf 1.7460250271332645, total cost 3.4866673946380615s 2024-09-13 01:38:05,087 INFO yield speech len 1.9969160997732427, rtf 1.820409257062386, total cost 3.6352062225341797s 2024-09-13 01:38:08,681 INFO yield speech len 1.9969160997732427, rtf 1.800061997058693, total cost 3.594573974609375s 2024-09-13 01:38:12,367 INFO yield speech len 1.9969160997732427, rtf 1.845703701499503, total cost 3.6857166290283203s 2024-09-13 01:38:14,757 INFO yield speech len 1.9969160997732427, rtf 1.196570088003957, total cost 2.389451503753662s 2024-09-13 01:38:17,185 INFO yield speech len 1.9969160997732427, rtf 1.2156676574133682, total cost 2.4275879859924316s 2024-09-13 01:38:19,820 INFO yield speech len 1.8692063492063493, rtf 1.4096638106781503, total cost 2.634953737258911s 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:57<00:00, 28.76s/it] infer time: 70.34108649999999 原来的： tn 这里假设音频是单声道（channels = 1），采样宽度为 2 字节（通常对应 16 位音频），帧率为 44100Hz。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。 to 这里假设音频是单声道（channels 等于一），采样宽度为二字节（通常对应十六位音频），帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。同时，这种方法使用了临时文件来将 NumPy 数组转换为 .wav 格式，如果需要更高效的方法，可以考虑直接在内存中进行 .wav 文件的构建而不使用临时文件。分割结果: ['这里假设音频是单声道channels等于一，采样宽度为二字节通常对应十六位音频，帧率为四万四千一百赫兹。在实际应用中，你可能需要根据音频的实际参数进行调整。', '同时，这种方法使用了临时文件来将NumPy数组转换为、wav格式，如果需要更高效的方法，可以考虑直接在内存中进行、wav文件的构建而不使用临时文件。'] llm cost: 15.500642776489258 s flow cost: 3.4648516178131104 s hifigan cost: 0.1598224639892578 s llm cost: 13.476516723632812 s flow cost: 3.245485544204712 s hifigan cost: 0.14189839363098145 s infer time: 33.4950243

I have the same case, the time consuming of the first chunk based on Load_jit=True is very high.

@shanhaidexiamo hi, Have you solved this problem?

Sep 19 '24 13:09 wang-TJ-20

we have tested the code, there is nothing wrong with the code. the final chunk has different length, so the onnx run needs more warmup step.

@aluminumbox Hi, can I ask why the final chunk has different length even when using the same input?

Sep 25 '24 09:09 huskyachao

cosyvoice_pb2

解决了吗？我遇到了 Traceback (most recent call last): File "server.py", line 18, in import cosyvoice_pb2 ModuleNotFoundError: No module named 'cosyvoice_pb2'

Dec 27 '24 09:12 520jefferson

cosyvoice_pb2

解决了吗？我遇到了 Traceback (most recent call last): File "server.py", line 18, in import cosyvoice_pb2 ModuleNotFoundError: No module named 'cosyvoice_pb2'

我也遇到这个问题了，有朋友解决了么？

Dec 28 '24 03:12 WendongGan

@WendongGan @520jefferson 在runtine/python/grpc下使用这个命令生成grpc需要的文件，python -m grpc_tools.protoc -I=. --python_out=. --grpc_python_out=. cosyvoice.proto

Jan 05 '25 08:01 wang-TJ-20

@WendongGan @520jefferson 在runtine/python/grpc下使用这个命令生成grpc需要的文件，python -m grpc_tools.protoc -I=. --python_out=. --grpc_python_out=. cosyvoice.proto

用这个解决了

May 14 '25 11:05 ops120