DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8

Open LinuxPersonEC opened this issue 1 year ago • 8 comments

Using:

cuda-dcgm-libs-3.1.3.1-198_cm9.2.x86_64
cuda-dcgm-nvvs-3.1.3.1-198_cm9.2.x86_64
cuda-dcgm-3.1.3.1-198_cm9.2.x86_64

The 'cm' stands for "Cluster Manager" as in Nvidia Bright Computing (now called Base Command).

The /var/log/nv-hostengine.log is filling up with these entries every few seconds: 2024-04-02 12:54:06.828 ERROR [1264985:1264994] [[NvSwitch]] Not attached to NvSwitches. Aborting [/workspaces/dcgm-rel_dcgm_3_1-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:967] [DcgmNs::DcgmNvSwitchManager::ReadNvSwitchStatusAllSwitches]

In /etc/dcgm.env we have: __DCGM_DBG_LVL=NONE

That seems to have quieted these logs: ERROR [5450:5462] Got more than DCGM_MAX_CLOCKS supported clocks. [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:11130] [DcgmCacheManager::AppendDeviceSupportedClocks]

These are the same errors from this DCGM Exporter bug.

LinuxPersonEC avatar Apr 02 '24 18:04 LinuxPersonEC

@SomePersonSomeWhereInTheWorld,

Can you confirm if the system has NvSwitches and if the correct version of the libnvidia-nscq package is installed?

nikkon-dev avatar Apr 03 '24 20:04 nikkon-dev

Well we load this as a module that Nvidia Bright Computing supplies. All I see is:

find /cm/local/apps/cuda-dcgm
name *vswitch*
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so.3
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so.3.1.3

And no sign of libnvidia-nscq.

LinuxPersonEC avatar Apr 04 '24 01:04 LinuxPersonEC

@SomePersonSomeWhereInTheWorld,

The libnvidia-nscq is not a part of the DCGM - that's a library required for NVSwitch / Fabricmanager to work correctly. You could find a proper package, for example, here: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/

I'm asking this, because on the system without nvswitches, those logs should not be written more than once. But if DCGM detects nvswitches in the system, it tries to enumerate them again and again, but without a proper nscq library (whose version should precisely match the installed driver), it fails to initialize, thus growing error logs.

nikkon-dev avatar Apr 04 '24 02:04 nikkon-dev

The libnvidia-nscq is not a part of the DCGM - that's a library required for NVSwitch / Fabricmanager to work correctly. You could find a proper package, for example, here: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/

We're on RHEL so I see this yum packaging page

But if DCGM detects nvswitches in the system, it tries to enumerate them again and again, but without a proper nscq library (whose version should precisely match the installed driver), it fails to initialize, thus growing error logs.

We are on NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1. I don't see a version 530.30.02 in the tar archives. Or is there a different driver version you are referring to?

LinuxPersonEC avatar Apr 04 '24 16:04 LinuxPersonEC

@SomePersonSomeWhereInTheWorld

Could you try this? https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/libnvidia-nscq-530-530.30.02-1.x86_64.rpm

nikkon-dev avatar Apr 04 '24 18:04 nikkon-dev

Hi 530.30.02 shipped bundled with CUDA, so the binary archive tarball is here https://developer.download.nvidia.com/compute/cuda/redist/libnvidia_nscq/linux-x86_64/

kmittman avatar Apr 04 '24 18:04 kmittman

OK the rpm worked! What's the proper configuration now that it's installed? Can you point me to some instructions ideally for RHEL?

LinuxPersonEC avatar Apr 04 '24 19:04 LinuxPersonEC

@SomePersonSomeWhereInTheWorld,

The instructions are the same. You did not find the tarball initially because the compute/nvidia-driver location only has TRD drivers, and your installed driver is a developer driver that is only installed with Cuda SDK (thus, you need to get tarballs from compute/cuda instead). I will see if the documentation should be updated to be clearer about this.

nikkon-dev avatar Apr 04 '24 20:04 nikkon-dev