k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

0/2 nodes are available: 2 Insufficient nvidia.com/gpu

Open nikosep opened this issue 4 years ago • 7 comments

Facing this old issue. I have gone through all the relevant workarounds, although still the issue persists.

Kubernetes version: 1.14 Docker version on GPU node: 19.03.6 GPU node: 4 x GTX1080Ti

I am trying to deploy this example:

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tensorflow-gpu
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: tensorflow-gpu
    spec:
      volumes:
      - hostPath:
          path: /usr/lib/nvidia-418/bin
        name: bin
      - hostPath:
          path: /usr/lib/nvidia-418
        name: lib
      - hostPath:
          path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
        name: libcuda-so-1
      - hostPath:
          path: /usr/lib/x86_64-linux-gnu/libcuda.so
        name: libcuda-so
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /usr/local/nvidia/bin
          name: bin
        - mountPath: /usr/local/nvidia/lib
          name: lib
        - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
          name: libcuda-so-1
        - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so
          name: libcuda-so
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-gpu-service
  labels:
    app: tensorflow-gpu
spec:
  selector:
    app: tensorflow-gpu
  ports:
  - port: 8888
    protocol: TCP
    nodePort: 30061
  type: LoadBalancer
---

And I am getting the following error: 0/2 nodes are available: 2 Insufficient nvidia.com/gpu

Specifying the GPU node explicitly on the deployment yaml I am getting the following error: Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected.

/etc/docker/daemon.json on GPU node:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

I have restarted docker and kubelet.

I am using this nvidia daemon: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Should I -label somehow the GPU node that has nvidia gpu? -restart master node?

Any help here is more than welcome !

nikosep avatar Mar 03 '20 10:03 nikosep

I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:

libdc1394 error: Failed to initialize libdc1394

Sarang-Sangram avatar Mar 18 '20 08:03 Sarang-Sangram

I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:

libdc1394 error: Failed to initialize libdc1394

I think you need to use as base Dockerfile image the nvidia one: FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04 ( I guess you have installed nvidia daemon on the cluster)

nikosep avatar Mar 18 '20 09:03 nikosep

You mean in the pod spec file ? Even after I use the above image I am seeing error :

libdc1394 error: Failed to initialize libdc1394

Sarang-Sangram avatar Mar 18 '20 09:03 Sarang-Sangram

So I skipped that example pod, and tried this deployment with less no, of replicas and it worked fine

https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml

Sarang-Sangram avatar Mar 18 '20 09:03 Sarang-Sangram

Hello!

Sorry for the lag, can you fill the default issue template, this is usually super helpful and it's easier to help :)

  • [ ] The output of nvidia-smi -a on your host
  • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
  • [ ] The k8s-device-plugin container logs
  • [ ] The node description (kubectl describe nodes)
  • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

RenaudWasTaken avatar Mar 21 '20 06:03 RenaudWasTaken

@RenaudWasTaken I think the issue is Docker default runtime is unable to set "nvidia" for Docker 19.03, runtime : nvidia has been deprecated, we need a fix on that

regulusv avatar May 08 '20 01:05 regulusv

Removed my previous comment with a link to this one so that there is one canonical place with a response to this issue:

https://github.com/NVIDIA/k8s-device-plugin/issues/168#issuecomment-625981223

klueska avatar May 08 '20 21:05 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Mar 31 '24 04:03 github-actions[bot]