nhc icon indicating copy to clipboard operation
nhc copied to clipboard

Replacement for NVIDIA_HEALTHMON check?

Open OleHolmNielsen opened this issue 7 years ago • 6 comments

We have some GPU nodes, and I would like to make the NVIDIA_HEALTHMON check in nhc.conf. Unfortunately, it seems that Nvidia no longer offers the nvidia-healthmon (at least I was unable to find this after a lot of searching).

Question: How may NHC check GPU health in the absense of nvidia-healthmon? One simple-minded check is the existence of the /dev/nvidia* files, like in this example nhc.conf line:

gpu* || check_file_test -c -r /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3

OleHolmNielsen avatar Jul 11 '17 14:07 OleHolmNielsen

This is related to my #29. Short version: investigate the NVVS portion (at least) of NVIDIA DCGM:

  • http://www.nvidia.com/object/data-center-gpu-manager.html

jrcoombs avatar Jul 26 '17 16:07 jrcoombs

Yes, as @jrcoombs mentioned, it's been replaced with a new tool, and unfortunately I've been swamped with the Trinity merge (and other LANL work) and have not kept up with NHC as I should've. I also need to work out with the Feynman Center for Innovation (LANL's version of Tech Transfer) all the logistics of a LANL employee accepting an nVidia contribution into an LBNL project! :-) So please bear with me a tad longer -- I hope to get this all ironed out in the next month or so and plan to roll a new release for SC17 with all these issues squared away.

mej avatar Jul 27 '17 01:07 mej

Let me know how I can help.

John

Sent by John from his mobile

On Jul 26, 2017, at 9:12 PM, Michael Jennings [email protected] wrote:

Yes, as @jrcoombs mentioned, it's been replaced with a new tool, and unfortunately I've been swamped with the Trinity merge (and other LANL work) and have not kept up with NHC as I should've. I also need to work out with the Feynman Center for Innovation (LANL's version of Tech Transfer) all the logistics of a LANL employee accepting an nVidia contribution into an LBNL project! :-) So please bear with me a tad longer -- I hope to get this all ironed out in the next month or so and plan to roll a new release for SC17 with all these issues squared away.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jrcoombs avatar Jul 27 '17 01:07 jrcoombs

Michael,

I want to be more specific in my offer of help.

In my day job I am an Alliance Manager in the Tesla BU at NVIDIA, with responsibility for (among other things) NVIDIA’s relationships with its cluster management (Bright) and job scheduling partners (Adaptive, Altair, IBM Spectrum, SchedMD/SLURM, Univa,…)

In that role, I can help get paperwork in front of the right folks in legal, and help get it signed expeditiously. I can provide you with access to specific NVIDIA GPUs (or if desirable, get you a GPU for development/qualification/validation). I can set up meetings (one-off or regularly scheduled) with members of the engineering team (and/or product management) for DCGM (which includes NVVS).They can answer questions you might have, and help expedite bug fixes.

My work email is above ([email protected] mailto:[email protected]). I know you are busy. Tag me as appropriate.

John

On Jul 26, 2017, at 9:12 PM, Michael Jennings [email protected] wrote:

Yes, as @jrcoombs https://github.com/jrcoombs mentioned, it's been replaced with a new tool, and unfortunately I've been swamped with the Trinity merge (and other LANL work) and have not kept up with NHC as I should've. I also need to work out with the Feynman Center for Innovation (LANL's version of Tech Transfer) all the logistics of a LANL employee accepting an nVidia contribution into an LBNL project! :-) So please bear with me a tad longer -- I hope to get this all ironed out in the next month or so and plan to roll a new release for SC17 with all these issues squared away.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mej/nhc/issues/42#issuecomment-318228294, or mute the thread https://github.com/notifications/unsubscribe-auth/AMSlhYnFJlW4QeFXvpJ_iMLKjOKUTompks5sR-QRgaJpZM4OUUyb.

jrcoombs avatar Jul 28 '17 16:07 jrcoombs

How can I get the NVVS tool? I find the docs about nvvs, while find no way to install it.

catsdogone avatar Nov 29 '17 03:11 catsdogone

NVVS (soon to be called DCGM GPU Diagnostic) is part of the DCGM package and can be obtained here: https://developer.nvidia.com/data-center-gpu-manager-dcgm

Note that the link to the NVVS user guide is also on that page.

bstollenvidia avatar Nov 29 '17 17:11 bstollenvidia