nomad icon indicating copy to clipboard operation
nomad copied to clipboard

"Nvidia GPU Device Plugin" not working

Open alexgornov opened this issue 3 years ago • 7 comments

Nomad version

Nomad v1.3.6+

Operating system and Environment details

CentOS Stream 8 Plugin "nomad-device-nvidia" v 1.0.0 (https://releases.hashicorp.com/nomad-device-nvidia/1.0.0/nomad-device-nvidia_1.0.0_linux_amd64.zip) NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7

Issue

"Nvidia GPU Device Plugin" not working on Nomad v1.3.6+

Reproduction steps

Install plugin nomad-device-nvidia on nomad v1.3.6+ Config file:

plugin_dir = "/opt/nomad/plugins"
...
plugin "nomad-device-nvidia" {
  config {
    enabled            = true
    fingerprint_period = "1m"
  }
}

Expected Result

There is a GPU in the output of the "nomad node status " command 1.3.5: 135 Log 1.3.5: nomad1.3.5.log

Actual Result

There is no GPU in output of "nomad node status " command 136

Log 1.3.6: nomad1.3.6.log

Thank you!

alexgornov avatar Oct 13 '22 11:10 alexgornov

I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):

job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nvidia/cuda:11.0.3-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

shoeffner avatar Oct 13 '22 14:10 shoeffner

I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):

job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nnvidia/cuda:11.0.3-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

job not starting image

alexgornov avatar Oct 13 '22 14:10 alexgornov

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.

edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.

shoeffner avatar Oct 13 '22 14:10 shoeffner

I have exact issue with 1.4.1, but 1.3.1 works fine

Fr0stoff avatar Oct 26 '22 07:10 Fr0stoff

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.

edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.

have same problem with Nomad 1.4.1 , did you managed to run GPU job some how? Thanks.

Fr0stoff avatar Oct 26 '22 08:10 Fr0stoff

No, we downgraded the GPU clients back to 1.3.5 and only run the servers on 1.4.1.

On Wed, Oct 26, 2022, 10:06 Fr0stoff @.***> wrote:

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.

edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.

have same problem with Nomad 1.4.1 , did you managed to run GPU job some how? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/14888#issuecomment-1291652992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOAOD6PDJCT7MESSUYAJIDWFDRATANCNFSM6AAAAAAREFYDV4 . You are receiving this because you commented.Message ID: @.***>

shoeffner avatar Oct 26 '22 13:10 shoeffner

Struggling with the same issue on 1.4.2 fwiw.

heipei avatar Oct 28 '22 11:10 heipei

Seeing this on 1.4.2 as well.

jessfraz avatar Nov 01 '22 18:11 jessfraz

I have the same problem when upgrading from 1.3.5 to 1.4.1.

Mileshin avatar Nov 02 '22 11:11 Mileshin

Hi folks, we've seen this issue and it's on our pile to triage. If you use the reaction 👍 on the top-level post that's more helpful, unless you're seeing the problem on a different driver version than the original post.

Does anyone have debug-level logs from a client during startup for this issue? It'd help kick off our investigation to see if the problem is in fingerprinting the device or whether the problem is in communicating with the driver.

tgross avatar Nov 02 '22 12:11 tgross

I can generate debug level logs tomorrow.

On Wed, Nov 2, 2022, 14:00 Tim Gross @.***> wrote:

Hi folks, we've seen this issue and it's on our pile to triage. If you use the reaction 👍 on the top-level post that's more helpful, unless you're seeing the problem on a different driver version than the original post.

Does anyone have debug-level logs from a client during startup for this issue? It'd help kick off our investigation to see if the problem is in fingerprinting the device or whether the problem is in communicating with the driver.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/14888#issuecomment-1300361496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOAODYWWPUHDKHTDDSKMRDWGJQVHANCNFSM6AAAAAAREFYDV4 . You are receiving this because you commented.Message ID: @.***>

shoeffner avatar Nov 02 '22 17:11 shoeffner

I've compiled a debug log during startup, let me know if this helps: https://gist.github.com/heipei/6d71b12fa086486b907729763981f27c

heipei avatar Nov 02 '22 20:11 heipei

Thanks @heipei. I've extracted the relevant log lines here:

device plugin logs

2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"] 2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909 2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia 2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2 2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins address=/tmp/plugin070418469 network=unix timestamp=2022-11-02T20:01:12.258Z 2022-11-02T20:01:12.260Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF" 2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909 2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"] 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923 2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia 2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins network=unix address=/tmp/plugin547638610 timestamp=2022-11-02T20:01:13.055Z 2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2 2022-11-02T20:01:13.056Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF" 2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923 2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins ... 2022-11-02T20:01:13.158Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=1.0.0 ... 2022-11-02T20:01:23.170Z [INFO] client.plugin: starting plugin manager: plugin-type=device 2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: starting plugin: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"] ... 2022-11-02T20:01:23.170Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device ... 2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: plugin started: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia pid=964949 2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: waiting for RPC address: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia ... 2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr: using plugin: plugin=nvidia-gpu version=2 2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr.nomad-device-nvidia: plugin address: plugin=nvidia-gpu address=/tmp/plugin379502824 network=unix timestamp=2022-11-02T20:01:23.879Z 2022-11-02T20:01:23.906Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device 2022-11-02T20:01:23.906Z [DEBUG] client: new devices detected: devices=1

It looks like the plugin is having trouble fingerprinting during the initial startup, but it's succeeding later (enough for the scheduler to detect that the client has done so, at least). I know it's been a minute since we did a Nvidia driver release, so I took a look at the repo and was reminded of https://github.com/hashicorp/nomad-device-nvidia/pull/6. There hasn't been a release of the changes made there. @shoenig you noted in the PR that we had some fixes to do with the implementation -- do you recall what the symptoms were there? (If not, I can try to stand up a box on AWS with an Nvidia card and dig in further.)

tgross avatar Nov 03 '22 12:11 tgross

IIRC what #6 uncovered was that the external plugin still imported the nvidia stuff from nomad, the act of which was enough to trigger an init block causing ??> bad things to happen. We may just need to finally cut a release with all the changes on the plugin side; let me try.

shoenig avatar Nov 03 '22 14:11 shoenig

Hi @tgross , I encountered the same problem today, by researching the code, I found that the Devices was not successfully updated in batchFirstFingerprints, maybe I can help fix it, so I made a PR, can you review it when you have time?

vuuihc avatar Nov 03 '22 17:11 vuuihc

Ah nice find @vuuihc, indeed this looks like fallout from https://github.com/hashicorp/nomad/pull/14139.

shoenig avatar Nov 03 '22 18:11 shoenig

Thanks for investigating and the PR @vuuihc! Fix should go out in the next releases of 1.4.x, 1.3.x, and 1.2.x

shoenig avatar Nov 03 '22 21:11 shoenig

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Mar 04 '23 02:03 github-actions[bot]