k8s-device-plugin
k8s-device-plugin copied to clipboard
0/2 nodes are available: 2 Insufficient nvidia.com/gpu
Facing this old issue. I have gone through all the relevant workarounds, although still the issue persists.
Kubernetes version: 1.14 Docker version on GPU node: 19.03.6 GPU node: 4 x GTX1080Ti
I am trying to deploy this example:
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tensorflow-gpu
spec:
replicas: 1
template:
metadata:
labels:
app: tensorflow-gpu
spec:
volumes:
- hostPath:
path: /usr/lib/nvidia-418/bin
name: bin
- hostPath:
path: /usr/lib/nvidia-418
name: lib
- hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
name: libcuda-so-1
- hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so
name: libcuda-so
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
ports:
- containerPort: 8888
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /usr/local/nvidia/bin
name: bin
- mountPath: /usr/local/nvidia/lib
name: lib
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
name: libcuda-so-1
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so
name: libcuda-so
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-gpu-service
labels:
app: tensorflow-gpu
spec:
selector:
app: tensorflow-gpu
ports:
- port: 8888
protocol: TCP
nodePort: 30061
type: LoadBalancer
---
And I am getting the following error: 0/2 nodes are available: 2 Insufficient nvidia.com/gpu
Specifying the GPU node explicitly on the deployment yaml I am getting the following error: Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected.
/etc/docker/daemon.json on GPU node:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
I have restarted docker and kubelet.
I am using this nvidia daemon: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
Should I -label somehow the GPU node that has nvidia gpu? -restart master node?
Any help here is more than welcome !
I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:
libdc1394 error: Failed to initialize libdc1394
I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:
libdc1394 error: Failed to initialize libdc1394
I think you need to use as base Dockerfile image the nvidia one: FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04 ( I guess you have installed nvidia daemon on the cluster)
You mean in the pod spec file ? Even after I use the above image I am seeing error :
libdc1394 error: Failed to initialize libdc1394
So I skipped that example pod, and tried this deployment with less no, of replicas and it worked fine
https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml
Hello!
Sorry for the lag, can you fill the default issue template, this is usually super helpful and it's easier to help :)
- [ ] The output of
nvidia-smi -a
on your host - [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json
) - [ ] The k8s-device-plugin container logs
- [ ] The node description (
kubectl describe nodes
) - [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
@RenaudWasTaken I think the issue is Docker default runtime is unable to set "nvidia" for Docker 19.03, runtime : nvidia has been deprecated, we need a fix on that
Removed my previous comment with a link to this one so that there is one canonical place with a response to this issue:
https://github.com/NVIDIA/k8s-device-plugin/issues/168#issuecomment-625981223
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.