k8s-device-plugin
k8s-device-plugin copied to clipboard
Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
kubectl describe po gpu-pod
Name: gpu-pod
Namespace: default
Priority: 0
Service Account: default
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9cp5g (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-9cp5g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 27s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
2. Steps to reproduce the issue
The VM is Ubuntu20.04
- Install 470.141.03 driver Output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-20C On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 1589MiB / 20475MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Deploy single node k8s using kubespray Kubernetes version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-11T02:46:24Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:29:58Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
- Install container toolkit
nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.11.0
commit: d9de4a0
- Edit containerd config
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
oom_score = 0
[grpc]
max_recv_message_size = 16777216
max_send_message_size = 16777216
[debug]
level = "info"
[metrics]
address = ""
grpc_histogram = false
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.7"
max_container_log_line_size = -1
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
runtime_engine = ""
runtime_root = ""
base_runtime_spec = "/etc/containerd/cri-base.json"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
systemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime""
- Enabling GPU Support in Kubernetes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
- Running GPU Jobs
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
3. Information to attach (optional if deemed irrelevant)
- containerd version
containerd --version
containerd github.com/containerd/containerd v1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
- KVM config
<domain type='kvm'>
<name>test-vm1</name>
<uuid>695b8bef-a78a-443a-950c-66a055df670a</uuid>
<metadata>
<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/doma>
<libosinfo:os id="http://ubuntu.com/ubuntu/20.04"/>
</libosinfo:libosinfo>
</metadata>
<memory unit='KiB'>4194304</memory>
<currentMemory unit='KiB'>4194304</currentMemory>
<vcpu placement='static'>4</vcpu>
<os>
<type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
</features>
<cpu mode='host-model' check='partial'/>
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
<suspend-to-mem enabled='no'/>
<suspend-to-disk enabled='no'/>
</pm>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='raw'/>
<source file='/var/lib/libvirt/images/test-disk1.qcow2'/>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
</disk>
<controller type='usb' index='0' model='ich9-ehci1'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x7'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci1'>
<master startport='0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x0' m>
</controller>
<controller type='usb' index='0' model='ich9-uhci2'>
<master startport='2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x1'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci3'>
<master startport='4'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x2'/>
</controller>
<controller type='sata' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
</controller>
<controller type='pci' index='0' model='pcie-root'/>
<controller type='pci' index='1' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='1' port='0x8'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' m>
</controller>
<controller type='pci' index='2' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='2' port='0x9'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
</controller>
<controller type='pci' index='3' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='3' port='0xa'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
</controller>
<controller type='pci' index='4' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='4' port='0xb'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
</controller>
<controller type='pci' index='5' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='5' port='0xc'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
</controller>
<controller type='pci' index='6' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='6' port='0xd'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
</controller>
<controller type='virtio-serial' index='0'>
<address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</controller>
<interface type='bridge'>
<mac address='52:54:00:6e:b1:69'/>
<source bridge='virbr0'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</interface>
<serial type='pty'>
<target type='isa-serial' port='0'>
<model name='isa-serial'/>
</target>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
<channel type='unix'>
<target type='virtio' name='org.qemu.guest_agent.0'/>
<address type='virtio-serial' controller='0' bus='0' port='1'/>
</channel>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='>
<source>
<address uuid='b06ebd67-f9eb-4ab3-b62d-f5f3762b9011'/>
</source>
<address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</hostdev>
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</memballoon>
<rng model='virtio'>
<backend model='random'>/dev/urandom</backend>
<address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</rng>
</devices>
</domain>
Common error checking:
- [ ] The output of
nvidia-smi -a
on your host - [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json
) - [ ] The k8s-device-plugin container logs
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version
- [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -a
- [ ] Any relevant kernel output lines from
dmesg
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
- [ ] NVIDIA container library version from
nvidia-container-cli -V
- [ ] NVIDIA container library logs (see troubleshooting)
What do the plugin logs look like, and what resources does your node say it has under Capacity
and Allocatable
when running kubectl get node
?
Here is the output of kubectl describe node
kubectl describe node server1
Name: server1
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=server1
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.122.148/24
projectcalico.org/IPv4VXLANTunnelAddr: 10.233.79.64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 28 Nov 2022 09:16:27 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: server1
AcquireTime: <unset>
RenewTime: Mon, 28 Nov 2022 10:22:21 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 28 Nov 2022 09:17:19 +0000 Mon, 28 Nov 2022 09:17:19 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:16:26 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:16:26 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:16:26 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:18:05 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.122.148
Hostname: server1
Capacity:
cpu: 4
ephemeral-storage: 204794888Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4025584Ki
pods: 110
Allocatable:
cpu: 3800m
ephemeral-storage: 188738968469
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3398896Ki
pods: 110
System Info:
Machine ID: 695b8befa78a443a950c66a055df670a
System UUID: 695b8bef-a78a-443a-950c-66a055df670a
Boot ID: e70f1479-1827-4387-b052-7e9a1a0d7211
Kernel Version: 5.4.0-132-generic
OS Image: Ubuntu 20.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.10
Kubelet Version: v1.25.4
Kube-Proxy Version: v1.25.4
PodCIDR: 10.233.64.0/24
PodCIDRs: 10.233.64.0/24
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
cert-manager cert-manager-55b8b5b94f-bxxbw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
cert-manager cert-manager-cainjector-655669b754-dd7qr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
cert-manager cert-manager-webhook-77d689b6df-xq25h 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system calico-kube-controllers-d6484b75c-b2d6v 30m (0%) 1 (26%) 64M (1%) 256M (7%) 65m
kube-system calico-node-ndjpn 150m (3%) 300m (7%) 64M (1%) 500M (14%) 65m
kube-system coredns-588bb58b94-bhs45 100m (2%) 0 (0%) 70Mi (2%) 300Mi (9%) 64m
kube-system dns-autoscaler-d8bd87bcc-65cdd 20m (0%) 0 (0%) 10Mi (0%) 0 (0%) 64m
kube-system kube-apiserver-server1 250m (6%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system kube-controller-manager-server1 200m (5%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system kube-proxy-d8std 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system kube-scheduler-server1 100m (2%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system local-volume-provisioner-8q86f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 64m
kube-system nodelocaldns-xlx8j 100m (2%) 0 (0%) 70Mi (2%) 200Mi (6%) 64m
kube-system nvidia-device-plugin-daemonset-7989w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 54m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 950m (25%) 1300m (34%)
memory 285286400 (8%) 1280288k (36%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
any update?
It seems that the plugin is not advertising any GPUs. Can you post the logs of the plugin?
Hi there!! Was the error solved? Because I am facing the same error and I am not able to solve it. Would be a huge help, if you could help me out here.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
Was the error solved?
The plugin log requested in https://github.com/NVIDIA/k8s-device-plugin/issues/348#issuecomment-1369699003 were never supplied. @xlcbingo1999 if you are stting similar behaviour, please provide a description of your setup as well as the plugin logs.