cuda_memtest icon indicating copy to clipboard operation
cuda_memtest copied to clipboard

Handle GPUs that lack full NVML Support

Open ax3l opened this issue 5 years ago • 5 comments

Nvidia NVML does not support non-Tesla product very well. Problems are known with mobile cards and even Quadro cards. (Reported as RFE to Nvidia as Bug ID 2417658.)

Anyway, this can lead to cuda_memtest throwing an [NVML] Error: Not supported (in nvmlDeviceGetSerial) exception which we should catch.

ax3l avatar Oct 11 '18 10:10 ax3l

Testing on a GTX 950M, I get this while running PIConGPU:

</home/berceanu/src/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-7.3.0/picongpu-0.4.0-lqbxwsudtgms2do4ksm57uovvv4ypx4e/thirdParty/cuda_memtest/misc.cpp>:35

It seems to be just a warning, as the simulation completes after that.

See that disabling the memtest fixes it:

pic-build -b "cuda:50" -c "-DCUDAMEMTEST_ENABLE=OFF"

Should we add a known issue in the docs for non-tesla cards?

berceanu avatar Oct 26 '18 11:10 berceanu

Thx for the report! Can you please post the warning? Is there a line missing?

ax3l avatar Oct 26 '18 12:10 ax3l

Nope, there is only that single line.

berceanu avatar Oct 26 '18 13:10 berceanu

Ah ok, but it does not abort, yes!

Ok, we have to clean up that macro, it should not randomly start to write to cerr: https://github.com/ComputationalRadiationPhysics/cuda_memtest/blob/7a585d504831431d0e95ff00d0217181201dbb12/cuda_memtest.h#L146-L150

ax3l avatar Nov 02 '18 13:11 ax3l

I proposed a fix in #18 that should remove that noisy line from your output. It can (rightfully) be ignored.

ax3l avatar Nov 02 '18 13:11 ax3l