Nik Konyuchenko

Results 96 comments of Nik Konyuchenko

@xwhuang0923, Please take a look at the `dcgmi discovery -c` output. In the `--device=i:X` argument, the `X` is the entity ID from the discovery command output, not the MIG Dev...

@SomePersonSomeWhereInTheWorld, Can you confirm if the system has NvSwitches and if the correct version of the libnvidia-nscq package is installed?

@SomePersonSomeWhereInTheWorld, The libnvidia-nscq is not a part of the DCGM - that's a library required for NVSwitch / Fabricmanager to work correctly. You could find a proper package, for example,...

@SomePersonSomeWhereInTheWorld Could you try this? https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/libnvidia-nscq-530-530.30.02-1.x86_64.rpm

@SomePersonSomeWhereInTheWorld, The instructions are the same. You did not find the tarball initially because the compute/nvidia-driver location only has TRD drivers, and your installed driver is a developer driver that...

From the logs I see that the `DCGM_FI_PROF_NVLINK_L0_TX_BYTES (1040)` field was used instead of `DCGM_FI_PROF_NVLINK_TX_BYTES (1011)`: `[[Profiling]] FieldId {1040} is not supported for GPU 0` The DCGM_FI_PROF_NVLINK_L0_TX_BYTES is only supported...

I see you are running nv-hostengine on port 5555. Could you rerun it with `-f host.debug.log --log-level debug` arguments and provide the host.debug.log after the dcgm-exporter starts reporting metrics or...

@nguoido, If you enable the debug logs, you should see the following message after those errors: `Plugin does not have a ShutdownPlugin function. This is not an error.`. Those are...

The ShutdownPlugin is not an error - if you enable the debug logs you'll see the message that missing function here is not an error. The actual issue in your...

I see that the plugin fails to initialize due to the error returned from the `cudaDeviceGetByPCIBusId` function. Is the `nvidia-smi` and `nvidia-smi -q` work on the system? That looks like...