DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

sm_stress test is missing from dcgm-3.2.3

Open bsteinb opened this issue 2 years ago • 1 comments

The sm_stress test seems to be missing from the latest release. When trying to run it explicitly, DCGM complains:

$ dcgmi diag -i 0,1,2,3 -v -r sm_stress --fail-early -p "sm_stress.target_stress=17000"
Invalid Parameter String: test 'sm_stress' does not match any loaded tests. Check logs for plugin failures.

The corresponding shared objects are no longer part of the RPM:

/usr/share/nvidia-validation-suite/plugins/cuda12/libSmStress.so
/usr/share/nvidia-validation-suite/plugins/cuda12/libSmStress.so.3
/usr/share/nvidia-validation-suite/plugins/cuda12/libSmStress.so.3.1.8

The release notes for version 3.1.3 mention that sm_stress is no longer run as part of diagnostic levels 3 or 4, but do not mention the test being removed in 3.2.3.

The sources for version 3.2.3 have not been exported to GitHub yet.

(As a side note, the documentation for DCGM Diagnostics contradict the release notes, since they list sm_stress as still being part of diagnostics levels 3 and 4.)

bsteinb avatar Aug 17 '23 12:08 bsteinb

The "sm_stress" test was deprecated in 3.1.3 because its functionality is superseded by the "diagnostic" test. It was removed in 3.2.3. The "diagnostic" test is the recommended replacement.

glowkey avatar Aug 23 '23 15:08 glowkey