use gpu-manager in cuda drvier11.6 , Function Not Found in Memory-Usage when use nvidia-smi in container
Problem
Using gpu-manager on cuda 11.6 version, its memory interface lacks function hijacking ?
Describes: FB Memory Usage Total : Function Not Found Reserved : Function Not Found Used : Function Not Found Free : Function Not Found
Execute the nvidia-smi command in the container, the display is as follows
xx:/# kubectl exec vcuda nvidia-smi
Mon May 30 03:24:18 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 Off | 00000000:1A:00.0 Off | 0 |
| 0% 50C P0 58W / 150W | Function Not Found | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 648808 C 1367MiB |
+-----------------------------------------------------------------------------+
The current gpu-manager supports cuda driver 11.5.1, Any one can support the cuda driver 11.6 ?
not yet, This issue was circumvented by downgrading the GPU driver to 11.3 Hope the community will support this driver version soon
------------------ 原始邮件 ------------------ 发件人: "tkestack/gpu-manager" @.>; 发送时间: 2022年6月28日(星期二) 中午11:33 @.>; @.@.>; 主题: Re: [tkestack/gpu-manager] use gpu-manager in cuda drvier11.6 , Function Not Found in Memory-Usage when use nvidia-smi in container (Issue #159)
@WindyLQL Hi, i got the same problem, did you solve the problem?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Adding nvmlDeviceGetMemoryInfo_v2 function in the vcuda_controller can solve this problem.