Duix.Heygem icon indicating copy to clipboard operation
Duix.Heygem copied to clipboard

RTX 5090 D 顯卡兼容性問題:CUDA 錯誤導致服務崩潰

Open JerryXMA opened this issue 9 months ago • 11 comments

RTX 5090 D 顯卡兼容性問題:CUDA 錯誤導致服務崩潰 RTX 5090 D Compatibility Issue: CUDA Error Causes Service Crash

問題描述 我在嘗試喺 Windows 11 系統上安裝同運行 HeyGem.ai 時,遇到咗嚴重嘅兼容性問題。三個服務(heygem-tts、heygem-asr 同 heygem-f2f)喺啟動後均崩潰,無法正常運行,導致預期端口(http://127.0.0.1:18180、http://127.0.0.1:10095 同 http://127.0.0.1:8383)無法訪問。 錯誤訊息 三個服務嘅日誌均顯示以下 CUDA 錯誤:

CUDA error: no kernel image is available for execution on the device

具體錯誤如下: heygem-tts 日誌:

RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

heygem-asr 日誌:

E20250316 21:18:42.496300 116 paraformer-torch.cpp:62] Error when load am model: ... CUDA error: no kernel image is available for execution on the device

heygem-f2f 日誌:

/usr/local/python3/lib/python3.8/site-packages/torch/cuda/init.py:218: UserWarning: NVIDIA GeForce RTX 5090 D with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_37 sm_90. RuntimeError: CUDA error: no kernel image is available for execution on the device

環境詳情 作業系統:Windows 11 顯卡:NVIDIA GeForce RTX 5090 D(CUDA 能力 sm_120) 驅動版本:572.70(支持 CUDA 12.8) Docker 版本:Docker Desktop 4.39.0

HeyGem.ai 鏡像: guiji2025/fish-speech-ziming(heygem-tts)

guiji2025/fun-asr(heygem-asr)

guiji2025/heygem.ai(heygem-f2f)

PyTorch 版本:鏡像內使用 PyTorch 1.12.0+cu113(從日誌推斷)

CUDA 版本:鏡像內使用 CUDA 12.1.1(從日誌推斷)

問題分析 問題嘅根源係 RTX 5090 D 顯卡嘅 CUDA 能力(sm_120)唔被鏡像內嘅 PyTorch 版本支持。根據日誌,當前 PyTorch 版本僅支持 sm_50 到 sm_90,而 sm_120 係 Blackwell 架構(RTX 50 系列)嘅新能力,需要 PyTorch 2.6 或每晚構建(nightly builds)支持。 已嘗試嘅解決方法 確認權限同路徑:確保 D:\heygem_data\voice\data 同 D:\heygem_data\face2face 存在,且 Users 組有「完全控制」權限。

更新 docker-compose.yml 路徑:將卷路徑從 D:\heygem_data... 改為 WSL 格式 /mnt/d/heygem_data/...,成功解決 invalid volume specification 錯誤。

啟用調試模式:添加環境變量 CUDA_LAUNCH_BLOCKING=1,但錯誤依然存在。

檢查 PyTorch 支持:發現 PyTorch 2.6 或每晚構建支持 sm_120,但鏡像內嘅 PyTorch 版本過舊(1.12.0+cu113)。

檢查官方 GitHub Issues:未找到相關討論。

請求 由於 HeyGem.ai 鏡像內嘅 PyTorch 版本唔支持 RTX 50 系列顯卡,請問官方可否提供更新嘅鏡像,包含支持 sm_120 嘅 PyTorch 版本(例如 PyTorch 2.6 或每晚構建)?或者提供其他解決方案(例如 CPU 模式配置)?

JerryXMA avatar Mar 17 '25 03:03 JerryXMA

One question here: how we main the cache consistency between the gateway and the inference engine, we may evict some cache once capacity full, right?

kerthcet avatar Mar 07 '25 07:03 kerthcet

@kerthcet Currently, the gateway cache and the inference cache are two separate cache systems. This separation means they can get out of sync. We have contemplated synchronizing the engine and the gateway, but the task is rather extensive. It would require significant additional effort to expose relevant information from the engine to the external environment, and this functionality is not yet supported.

One aspect of our design consideration is that caching remains a best-efforts approach. Incorrect allocation will trigger the generation of fresh information on the pod, which in a way corrects the indexing. However, this process demands more benchmarking and long - term observation, especially when running the service over an extended period.

Jeffwan avatar Mar 07 '25 08:03 Jeffwan

@Jeffwan @kerthcet Dynamo maintains consistent view using kind of pub-sub implementation

dynamo kv routing

gangmuk avatar Mar 19 '25 18:03 gangmuk

To summarize the issue, there are two aspects for prefix-cache routing which can be generalized 1) matching 2) load balancing.

  1. For matching there are two implementations hash and tree based, for future if someone needs to experiment with a different prefix match algorithm then either they can leverage one of these or build a new data structure.
  2. For load balancing, both implemented approaches use different load balancing algorithm which are embedded in the prefix match implementation. In future, if a third prefix match algorithm needs to be experimented with then load-balancing part can be refactored out or as needed a new one can be built.

For now, I will close this issue as completed.

varungup90 avatar Apr 08 '25 01:04 varungup90