DCGM
DCGM copied to clipboard
./dcgmproftester11 return WatchFields error
Hi I am running dcgmproftester11 bin file., which is located in "/home/DCGM/_out/Linux-amd64-debug/share/dcgm_tests/apps/amd64" When I execute ./dcgmproftester11, it gives me an error "dcgmWatchFields() returned -33. [/workspaces/DCGM/dcgmproftester/DcgmProfTester.cpp:231] [DcgmProfTester::WatchFields]"
Is this a normal Phenomenon?
My working env is nvidia-smi: NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:3E:00.0 Off | N/A | | 22% 30C P8 21W / 250W | 0MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:40:00.0 Off | N/A | | 22% 29C P8 7W / 250W | 0MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ and the version of DCGM is 3.0.4
@ligeweiwu,
To use dcgmproftester,
there are two options:
- Obtain the
libdcgmmoduleprofiling.so
file from the official DCGM package. - Use the
--no-dcgm-validation
flag to generate load without reading metric values back.
Please note that the profiling module that provides DCP (1001-10XX) metrics is not open-sourced.
@nikkon-dev
Please note that the profiling module that provides DCP (1001-10XX) metrics is not open-sourced.
Can you clarify what this entails?
I installed DCGM on Fedora using the RHEL packages, which also seems to include dcgmproftester12. However, it does not seem to be able to load the profiling module. Should I try building from the repo directly and linking the dcgmmoduleprofiling
library installed from the package?
@SamKG,
The DCGM_FI_PROF_* metrics (also known as DCP) are managed by the libdcgmmoduleprofiling.so
library, which is not open-source. The dcgmproftester*
tool is specifically designed to test these fields.
Open-sourced part cannot handle validation on its own - you either need to specify --no-dcgm-validation
or put the libdcgmmoduleprofiling.so
next to the dcgmproftester*
binary or specify LD_LIBRARY_PATH env variable.
The dcgmproftester
binary will not search for the libdcgmmoduleprofiling.so
library installed system-wide from the CUDA repository. Each dcgm
binary has RPATH configured to search for libraries in the $ORIGIN/:$ORIGIN/../lib
directories.
@nikkon-dev
Thanks!
I tried both methods (copy-pasting the .so
file, and setting LD_LIBRARY_PATH
). However, I still appear to get a Failed to load
error for the Profiling module using dcgmi modules -l
. dcgmproftester12
also does not work.
Is there any way to debug?
@SamKG,
You can run nv-hostengine with the log-level debug option to see a more detailed error message. However, based on your nvidia-smi output, it appears that you have a GeForce-class GPU, which is not supported by DCP metrics. It's important to note that dcgmproftester is not a generic stress testing tool; its purpose is to test that DCP metrics report expected values.
@nikkon-dev
Thanks, I'll try. Regarding support for metrics: my understanding of this support matrix (https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html?highlight=geforce) is that GeForce does support gpu metrics. Are these the same as DCP metrics? (if so, is there a way to tell which metrics are/are not supported by GeForce)?
@nikkon-dev Sorry to border you, do you mean that the libdcgmmoduleprofiling.so file is not open source and can only be copied from NGC mirrors?