HAMi icon indicating copy to clipboard operation
HAMi copied to clipboard

CUDA内存拦截的逻辑BUG

Open coldzerofear opened this issue 9 months ago • 1 comments

在使用LLaMa-Facorty容器化部署时,通过指定CUDA_VISIBLE_DEVICES环境变量选择在哪些GPU上运行,此时如果容器分配了2张卡,而CUDA_VISIBLE_DEVICES人为指定了非0卡,那么通过nvidia-smi命令查询显存使用,显存用量永远加在0号卡上,实际上应该在1号卡上。

root@chenweiyi-ed43f-0:/mnt/chenweiyi/LLaMA-Factory# nvidia-smi 
Fri May 24 17:59:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:1A:00.0 Off |                  Off |
| N/A   83C    P0             146W / 400W |   6858MiB /  8192MiB |     99%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:88:00.0 Off |                    0 |
| N/A   86C    P0             235W / 400W |      0MiB /  8192MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

从宿主机上能看到是运行于1号卡上。

分析是cuda通过识别CUDA_VISIBLE_DEVICES环境变量来确定卡的逻辑顺序0、1、2、3... 而nvml库在查询时不会使用这个环境变量。 而hami拦截库通过cuCtxGetDevice(&dev);确定当前卡的顺序,在这种情况下造成了顺序偏移。

这种情况是否应该通过设备uuid来确定顺序

coldzerofear avatar May 24 '24 10:05 coldzerofear