gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Not able to obtain metrics for pods in GPU node using DCGM Exporter. nv-hostengine debug logs give Error: Could not load NSCQ.

Open suchisur opened this issue 1 year ago • 4 comments

Trying to obtain per-process GPU metrics using DCGM-exporter logs from nvhostengine :

root@dcgm-exporter-tlb4f:/# 2021-11-23 00:15:28.951 ERROR [82:82] Cannot initialize the hostengine: Error: Failed to initialize NVML [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3647] [DcgmHostEngineHandler::Init]
bash: 2021-11-23: command not found
root@dcgm-exporter-tlb4f:/# 2021-11-23 00:15:28.951 ERROR [82:82] DcgmHostEngineHandler::Init failed [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6824] [dcgmStartEmbedded_v2]

When I do an nvidia-smi on the node level I see all the processes with PID and GPU memory utilization, but from within the exporter pod all I see is :

Screenshot 2023-03-13 at 2 05 25 PM

Tried the following solutions: https://github.com/NVIDIA/dcgm-exporter/issues/27 https://github.com/NVIDIA/gpu-operator/issues/294

suchisur avatar Mar 13 '23 08:03 suchisur