DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

./dcgmproftester11 return WatchFields error

Open ligeweiwu opened this issue 1 year ago • 7 comments

Hi I am running dcgmproftester11 bin file., which is located in "/home/DCGM/_out/Linux-amd64-debug/share/dcgm_tests/apps/amd64" When I execute ./dcgmproftester11, it gives me an error "dcgmWatchFields() returned -33. [/workspaces/DCGM/dcgmproftester/DcgmProfTester.cpp:231] [DcgmProfTester::WatchFields]"

Is this a normal Phenomenon?

My working env is nvidia-smi: NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:3E:00.0 Off | N/A | | 22% 30C P8 21W / 250W | 0MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:40:00.0 Off | N/A | | 22% 29C P8 7W / 250W | 0MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ and the version of DCGM is 3.0.4

ligeweiwu avatar Mar 06 '23 08:03 ligeweiwu

@ligeweiwu,

To use dcgmproftester, there are two options:

  • Obtain the libdcgmmoduleprofiling.so file from the official DCGM package.
  • Use the --no-dcgm-validation flag to generate load without reading metric values back.

Please note that the profiling module that provides DCP (1001-10XX) metrics is not open-sourced.

nikkon-dev avatar Mar 06 '23 20:03 nikkon-dev

@nikkon-dev

Please note that the profiling module that provides DCP (1001-10XX) metrics is not open-sourced.

Can you clarify what this entails? I installed DCGM on Fedora using the RHEL packages, which also seems to include dcgmproftester12. However, it does not seem to be able to load the profiling module. Should I try building from the repo directly and linking the dcgmmoduleprofiling library installed from the package?

SamKG avatar Apr 10 '23 20:04 SamKG

@SamKG,

The DCGM_FI_PROF_* metrics (also known as DCP) are managed by the libdcgmmoduleprofiling.so library, which is not open-source. The dcgmproftester* tool is specifically designed to test these fields.

Open-sourced part cannot handle validation on its own - you either need to specify --no-dcgm-validation or put the libdcgmmoduleprofiling.so next to the dcgmproftester* binary or specify LD_LIBRARY_PATH env variable.

The dcgmproftester binary will not search for the libdcgmmoduleprofiling.so library installed system-wide from the CUDA repository. Each dcgm binary has RPATH configured to search for libraries in the $ORIGIN/:$ORIGIN/../lib directories.

nikkon-dev avatar Apr 10 '23 21:04 nikkon-dev

@nikkon-dev Thanks! I tried both methods (copy-pasting the .so file, and setting LD_LIBRARY_PATH). However, I still appear to get a Failed to load error for the Profiling module using dcgmi modules -l. dcgmproftester12 also does not work.

Is there any way to debug?

SamKG avatar Apr 10 '23 21:04 SamKG

@SamKG,

You can run nv-hostengine with the log-level debug option to see a more detailed error message. However, based on your nvidia-smi output, it appears that you have a GeForce-class GPU, which is not supported by DCP metrics. It's important to note that dcgmproftester is not a generic stress testing tool; its purpose is to test that DCP metrics report expected values.

nikkon-dev avatar Apr 10 '23 22:04 nikkon-dev

@nikkon-dev

Thanks, I'll try. Regarding support for metrics: my understanding of this support matrix (https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html?highlight=geforce) is that GeForce does support gpu metrics. Are these the same as DCP metrics? (if so, is there a way to tell which metrics are/are not supported by GeForce)?

SamKG avatar Apr 10 '23 23:04 SamKG

@nikkon-dev Sorry to border you, do you mean that the libdcgmmoduleprofiling.so file is not open source and can only be copied from NGC mirrors?

AltarIbnL avatar Apr 26 '23 09:04 AltarIbnL