nhc icon indicating copy to clipboard operation
nhc copied to clipboard

NVVS (part of NVIDIA DCGM) has replaced nv-healthmon. NHC will fail on new GPUs w/o code mods

Open jrcoombs opened this issue 7 years ago • 5 comments

NVVS (part of NVIDIA's DCGM: Data Center GPU Manager) is the replacement for nv-healthmon, which is deprecated and unsupported for new and future NVIDIA hardware. Health checking for Pascal microarchitecture (P100/P4/P40 and later) NVIDIA GPUs installed on clusters using NHC will fail without appropriate modifications to NHC.

DCGM link: http://www.nvidia.com/object/data-center-gpu-manager.html

I can put you in direct contact with the DCGM engineering team at NVIDIA and get you the appropriate GPUs for your development and testing. When you are interested, just send me an email.

John Coombs Tesla BU Alliance Management NVIDIA [email protected]

jrcoombs avatar Apr 03 '17 15:04 jrcoombs

How would I acquire the Release Candidate referenced in Nvidia document DU-07862-001_v1.3, page 25... We are testing a GPU cluster and are looking for more verbose output from 'dcgmi diag -r 3'. As it only returns "PCIe Fail - All" which is too vague to be helpful.

jmcculloch4 avatar Apr 24 '18 19:04 jmcculloch4

All DCGM packages and docs can be obtained here: https://developer.nvidia.com/data-center-gpu-manager-dcgm

bstollenvidia avatar Apr 26 '18 17:04 bstollenvidia

FYI, nvvs doesn't seem to work with MIG enable GPU's:

/usr/share/nvidia-validation-suite/nvvs

DCGM GPU Diagnostic (version 418)

GPU 0's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.

mick-t avatar May 24 '23 20:05 mick-t

If you need help testing any new tools to check on nvidia cards I can help.

mick-t avatar May 24 '23 20:05 mick-t

I am no longer with NVIDIA. (I retired in 2020.) Duncan (copied) can tell you who to be in touch with there.   JohnSent by John from his mobileOn May 24, 2023, at 16:30, Mick T. @.***> wrote: If you need help testing any new tools to check on nvidia cards I can help.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

jrcoombs avatar May 24 '23 23:05 jrcoombs