nomad
nomad copied to clipboard
"Nvidia GPU Device Plugin" not working
Nomad version
Nomad v1.3.6+
Operating system and Environment details
CentOS Stream 8 Plugin "nomad-device-nvidia" v 1.0.0 (https://releases.hashicorp.com/nomad-device-nvidia/1.0.0/nomad-device-nvidia_1.0.0_linux_amd64.zip) NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7
Issue
"Nvidia GPU Device Plugin" not working on Nomad v1.3.6+
Reproduction steps
Install plugin nomad-device-nvidia on nomad v1.3.6+ Config file:
plugin_dir = "/opt/nomad/plugins"
...
plugin "nomad-device-nvidia" {
config {
enabled = true
fingerprint_period = "1m"
}
}
Expected Result
There is a GPU in the output of the "nomad node status
Log 1.3.5:
nomad1.3.5.log
Actual Result
There is no GPU in output of "nomad node status 
Log 1.3.6: nomad1.3.6.log
Thank you!
I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):
job "gpu-test" {
datacenters = ["dc1"]
type = "batch"
group "smi" {
task "smi" {
driver = "docker"
config {
image = "nvidia/cuda:11.0.3-base-ubuntu20.04"
command = "nvidia-smi"
}
resources {
device "nvidia/gpu" {
count = 1
}
}
}
}
}
I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):
job "gpu-test" { datacenters = ["dc1"] type = "batch" group "smi" { task "smi" { driver = "docker" config { image = "nnvidia/cuda:11.0.3-base-ubuntu20.04" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 } } } } }
job not starting

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.
edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.
I have exact issue with 1.4.1, but 1.3.1 works fine
It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.
edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.
have same problem with Nomad 1.4.1 , did you managed to run GPU job some how? Thanks.
No, we downgraded the GPU clients back to 1.3.5 and only run the servers on 1.4.1.
On Wed, Oct 26, 2022, 10:06 Fr0stoff @.***> wrote:
It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.
edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.
have same problem with Nomad 1.4.1 , did you managed to run GPU job some how? Thanks.
— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/14888#issuecomment-1291652992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOAOD6PDJCT7MESSUYAJIDWFDRATANCNFSM6AAAAAAREFYDV4 . You are receiving this because you commented.Message ID: @.***>
Struggling with the same issue on 1.4.2 fwiw.
Seeing this on 1.4.2 as well.
I have the same problem when upgrading from 1.3.5 to 1.4.1.
Hi folks, we've seen this issue and it's on our pile to triage. If you use the reaction 👍 on the top-level post that's more helpful, unless you're seeing the problem on a different driver version than the original post.
Does anyone have debug-level logs from a client during startup for this issue? It'd help kick off our investigation to see if the problem is in fingerprinting the device or whether the problem is in communicating with the driver.
I can generate debug level logs tomorrow.
On Wed, Nov 2, 2022, 14:00 Tim Gross @.***> wrote:
Hi folks, we've seen this issue and it's on our pile to triage. If you use the reaction 👍 on the top-level post that's more helpful, unless you're seeing the problem on a different driver version than the original post.
Does anyone have debug-level logs from a client during startup for this issue? It'd help kick off our investigation to see if the problem is in fingerprinting the device or whether the problem is in communicating with the driver.
— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/14888#issuecomment-1300361496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOAODYWWPUHDKHTDDSKMRDWGJQVHANCNFSM6AAAAAAREFYDV4 . You are receiving this because you commented.Message ID: @.***>
I've compiled a debug log during startup, let me know if this helps: https://gist.github.com/heipei/6d71b12fa086486b907729763981f27c
Thanks @heipei. I've extracted the relevant log lines here:
device plugin logs
2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"] 2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909 2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia 2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2 2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins address=/tmp/plugin070418469 network=unix timestamp=2022-11-02T20:01:12.258Z 2022-11-02T20:01:12.260Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF" 2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909 2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"] 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia 2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins network=unix address=/tmp/plugin547638610 timestamp=2022-11-02T20:01:13.055Z 2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2 2022-11-02T20:01:13.056Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF" 2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923 2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins ... 2022-11-02T20:01:13.158Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=1.0.0 ... 2022-11-02T20:01:23.170Z [INFO] client.plugin: starting plugin manager: plugin-type=device 2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: starting plugin: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"] ... 2022-11-02T20:01:23.170Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device ... 2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: plugin started: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia pid=964949 2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: waiting for RPC address: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia ... 2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr: using plugin: plugin=nvidia-gpu version=2 2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr.nomad-device-nvidia: plugin address: plugin=nvidia-gpu address=/tmp/plugin379502824 network=unix timestamp=2022-11-02T20:01:23.879Z 2022-11-02T20:01:23.906Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device 2022-11-02T20:01:23.906Z [DEBUG] client: new devices detected: devices=1
It looks like the plugin is having trouble fingerprinting during the initial startup, but it's succeeding later (enough for the scheduler to detect that the client has done so, at least). I know it's been a minute since we did a Nvidia driver release, so I took a look at the repo and was reminded of https://github.com/hashicorp/nomad-device-nvidia/pull/6. There hasn't been a release of the changes made there. @shoenig you noted in the PR that we had some fixes to do with the implementation -- do you recall what the symptoms were there? (If not, I can try to stand up a box on AWS with an Nvidia card and dig in further.)
IIRC what #6 uncovered was that the external plugin still imported the nvidia stuff from nomad, the act of which was enough to trigger an init block causing ??> bad things to happen. We may just need to finally cut a release with all the changes on the plugin side; let me try.
Hi @tgross , I encountered the same problem today, by researching the code, I found that the Devices was not successfully updated in batchFirstFingerprints, maybe I can help fix it, so I made a PR, can you review it when you have time?
Ah nice find @vuuihc, indeed this looks like fallout from https://github.com/hashicorp/nomad/pull/14139.
Thanks for investigating and the PR @vuuihc! Fix should go out in the next releases of 1.4.x, 1.3.x, and 1.2.x
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.