nhc
nhc copied to clipboard
NVVS (part of NVIDIA DCGM) has replaced nv-healthmon. NHC will fail on new GPUs w/o code mods
NVVS (part of NVIDIA's DCGM: Data Center GPU Manager) is the replacement for nv-healthmon, which is deprecated and unsupported for new and future NVIDIA hardware. Health checking for Pascal microarchitecture (P100/P4/P40 and later) NVIDIA GPUs installed on clusters using NHC will fail without appropriate modifications to NHC.
DCGM link: http://www.nvidia.com/object/data-center-gpu-manager.html
I can put you in direct contact with the DCGM engineering team at NVIDIA and get you the appropriate GPUs for your development and testing. When you are interested, just send me an email.
John Coombs Tesla BU Alliance Management NVIDIA [email protected]
How would I acquire the Release Candidate referenced in Nvidia document DU-07862-001_v1.3, page 25... We are testing a GPU cluster and are looking for more verbose output from 'dcgmi diag -r 3'. As it only returns "PCIe Fail - All" which is too vague to be helpful.
All DCGM packages and docs can be obtained here: https://developer.nvidia.com/data-center-gpu-manager-dcgm
FYI, nvvs
doesn't seem to work with MIG enable GPU's:
/usr/share/nvidia-validation-suite/nvvs
DCGM GPU Diagnostic (version 418)
GPU 0's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.
If you need help testing any new tools to check on nvidia cards I can help.
I am no longer with NVIDIA. (I retired in 2020.) Duncan (copied) can tell you who to be in touch with there. JohnSent by John from his mobileOn May 24, 2023, at 16:30, Mick T. @.***> wrote: If you need help testing any new tools to check on nvidia cards I can help.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>