gpu-operator
gpu-operator copied to clipboard
Not able to obtain metrics for pods in GPU node using DCGM Exporter. nv-hostengine debug logs give Error: Could not load NSCQ.
Trying to obtain per-process GPU metrics using DCGM-exporter logs from nvhostengine :
root@dcgm-exporter-tlb4f:/# 2021-11-23 00:15:28.951 ERROR [82:82] Cannot initialize the hostengine: Error: Failed to initialize NVML [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3647] [DcgmHostEngineHandler::Init]
bash: 2021-11-23: command not found
root@dcgm-exporter-tlb4f:/# 2021-11-23 00:15:28.951 ERROR [82:82] DcgmHostEngineHandler::Init failed [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6824] [dcgmStartEmbedded_v2]
When I do an nvidia-smi on the node level I see all the processes with PID and GPU memory utilization, but from within the exporter pod all I see is :
data:image/s3,"s3://crabby-images/60697/606978aee4bed3d8f05bd045d1f2b3ead486f246" alt="Screenshot 2023-03-13 at 2 05 25 PM"
Tried the following solutions: https://github.com/NVIDIA/dcgm-exporter/issues/27 https://github.com/NVIDIA/gpu-operator/issues/294