dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Facing issue with DCGM exporter due to Nvidia GPU Operator initialization problem

Open jaipreetnagpal opened this issue 8 months ago • 2 comments

We are facing issue while using DCGM exporter from datadog along with the NVIDIA GPU Operator in order to monitor the GPU resources through Datadog. Currently we are utilizing the Nvidia GPU operator on the ROSA (Redhat) Platform. We are getting issues with the NVIDIA configuration as per the errors as observed by Datadog and our team :

Nvidia hostengine logs, we can see multiple errors during start-up, which indicate an initialization problem: ERROR [1:11] Cannot load NVML; DCGM will proceed without managing GPUs. [/builds/dcgm/dcgm/dcgmlib/src/DcgmHostEngineHandler.cpp:1545] [DcgmHostEngineHandler::LoadNvml]

ERROR [1:11] [[NvSwitch]] Could not load NVSDM [/builds/dcgm/dcgm/modules/nvswitch/DcgmNvsdmManager.cpp:621] [DcgmNs::DcgmNvsdmManager::AttachToNvsdm]

ERROR [1:11] [[NvSwitch]] AttachToNvsdm() returned -25 [/builds/dcgm/dcgm/modules/nvswitch/DcgmNvsdmManager.cpp:587] [DcgmNs::DcgmNvsdmManager::Init]

ERROR [1:11] [[NvSwitch]] Could not load NSCQ. dlwrap_attach ret: Can not access a needed shared library (-79): If this system has NvSwitches, please ensure that the package libnvidia-nscq is installed on your system and that the service user has permissions to access it. [/builds/dcgm/dcgm/modules/nvswitch/DcgmNscqManager.cpp:502] [DcgmNs::DcgmNscqManager::AttachToNscq]

Also we reached out to the datadog team Regarding the issue, they have checked the configuration is correct , they had asked us to reach out to the Nvidia Team. We tried reaching them but as we donot have enterprise account for Nvidia , they are not able to support on the same.

Image

Image

Image

jaipreetnagpal avatar Apr 28 '25 08:04 jaipreetnagpal