nhc
nhc copied to clipboard
Replacement for NVIDIA_HEALTHMON check?
We have some GPU nodes, and I would like to make the NVIDIA_HEALTHMON check in nhc.conf. Unfortunately, it seems that Nvidia no longer offers the nvidia-healthmon (at least I was unable to find this after a lot of searching).
Question: How may NHC check GPU health in the absense of nvidia-healthmon? One simple-minded check is the existence of the /dev/nvidia* files, like in this example nhc.conf line:
gpu* || check_file_test -c -r /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3
This is related to my #29. Short version: investigate the NVVS portion (at least) of NVIDIA DCGM:
- http://www.nvidia.com/object/data-center-gpu-manager.html
Yes, as @jrcoombs mentioned, it's been replaced with a new tool, and unfortunately I've been swamped with the Trinity merge (and other LANL work) and have not kept up with NHC as I should've. I also need to work out with the Feynman Center for Innovation (LANL's version of Tech Transfer) all the logistics of a LANL employee accepting an nVidia contribution into an LBNL project! :-) So please bear with me a tad longer -- I hope to get this all ironed out in the next month or so and plan to roll a new release for SC17 with all these issues squared away.
Let me know how I can help.
John
Sent by John from his mobile
On Jul 26, 2017, at 9:12 PM, Michael Jennings [email protected] wrote:
Yes, as @jrcoombs mentioned, it's been replaced with a new tool, and unfortunately I've been swamped with the Trinity merge (and other LANL work) and have not kept up with NHC as I should've. I also need to work out with the Feynman Center for Innovation (LANL's version of Tech Transfer) all the logistics of a LANL employee accepting an nVidia contribution into an LBNL project! :-) So please bear with me a tad longer -- I hope to get this all ironed out in the next month or so and plan to roll a new release for SC17 with all these issues squared away.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Michael,
I want to be more specific in my offer of help.
In my day job I am an Alliance Manager in the Tesla BU at NVIDIA, with responsibility for (among other things) NVIDIA’s relationships with its cluster management (Bright) and job scheduling partners (Adaptive, Altair, IBM Spectrum, SchedMD/SLURM, Univa,…)
In that role, I can help get paperwork in front of the right folks in legal, and help get it signed expeditiously. I can provide you with access to specific NVIDIA GPUs (or if desirable, get you a GPU for development/qualification/validation). I can set up meetings (one-off or regularly scheduled) with members of the engineering team (and/or product management) for DCGM (which includes NVVS).They can answer questions you might have, and help expedite bug fixes.
My work email is above ([email protected] mailto:[email protected]). I know you are busy. Tag me as appropriate.
John
On Jul 26, 2017, at 9:12 PM, Michael Jennings [email protected] wrote:
Yes, as @jrcoombs https://github.com/jrcoombs mentioned, it's been replaced with a new tool, and unfortunately I've been swamped with the Trinity merge (and other LANL work) and have not kept up with NHC as I should've. I also need to work out with the Feynman Center for Innovation (LANL's version of Tech Transfer) all the logistics of a LANL employee accepting an nVidia contribution into an LBNL project! :-) So please bear with me a tad longer -- I hope to get this all ironed out in the next month or so and plan to roll a new release for SC17 with all these issues squared away.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mej/nhc/issues/42#issuecomment-318228294, or mute the thread https://github.com/notifications/unsubscribe-auth/AMSlhYnFJlW4QeFXvpJ_iMLKjOKUTompks5sR-QRgaJpZM4OUUyb.
How can I get the NVVS tool? I find the docs about nvvs, while find no way to install it.
NVVS (soon to be called DCGM GPU Diagnostic) is part of the DCGM package and can be obtained here: https://developer.nvidia.com/data-center-gpu-manager-dcgm
Note that the link to the NVVS user guide is also on that page.