同一份代码,同一段文字转语音,4090 2-3秒,A100、3090都要10来秒,独占显卡
有没有大佬遇到同意的问题
4090 显卡信息:NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 A100 显卡信息:NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 3090 显卡信息:NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
我也碰到了一样的问题,有人解决了吗?有并发延迟会有点高
Same here.
I wonder if the cause could be the onnxruntime. Whatever ways to install, I get those warnings when loading the model:
2025-08-08 15:33:24.150494459 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 10 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-08-08 15:33:24.152160184 [W:onnxruntime:, session_state.cc:1280 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-08-08 15:33:24.152175048 [W:onnxruntime:, session_state.cc:1282 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
有大佬改过吗?定位是flow inference 资源竞争导致的(伪并发),4090确实还好,但是换A100,问题就明显
遇到了同样的问题,3090就很快,H20就比较慢,有方案没?
有没有大佬已解决的,这个问题我也遇到了,不过我的是3090快,4090慢,速度差了7倍多。换统一torch,统一onnxruntime都试了,不得行。