k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Open somethingwentwell opened this issue 2 years ago • 9 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

 kubectl describe po gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9cp5g (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-9cp5g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  27s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

2. Steps to reproduce the issue

The VM is Ubuntu20.04

  1. Install 470.141.03 driver Output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100D-20C      On   | 00000000:06:00.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |   1589MiB / 20475MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

  1. Deploy single node k8s using kubespray Kubernetes version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-11T02:46:24Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:29:58Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
  1. Install container toolkit
nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.11.0
commit: d9de4a0
  1. Edit containerd config
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
oom_score = 0

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[debug]
  level = "info"

[metrics]
  address = ""
  grpc_histogram = false

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "registry.k8s.io/pause:3.7"
    max_container_log_line_size = -1
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      snapshotter = "overlayfs"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          base_runtime_spec = "/etc/containerd/cri-base.json"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            systemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime""
  1. Enabling GPU Support in Kubernetes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
  1. Running GPU Jobs
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

3. Information to attach (optional if deemed irrelevant)

  1. containerd version
containerd --version
containerd github.com/containerd/containerd v1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
  1. KVM config
<domain type='kvm'>
  <name>test-vm1</name>
  <uuid>695b8bef-a78a-443a-950c-66a055df670a</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/doma>
      <libosinfo:os id="http://ubuntu.com/ubuntu/20.04"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <os>
    <type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-model' check='partial'/>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/test-disk1.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x0' m>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x2'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' m>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:6e:b1:69'/>
      <source bridge='virbr0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='>
      <source>
        <address uuid='b06ebd67-f9eb-4ab3-b62d-f5f3762b9011'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </rng>
  </devices>
</domain>


Common error checking:

  • [ ] The output of nvidia-smi -a on your host
  • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
  • [ ] The k8s-device-plugin container logs
  • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • [ ] Docker version from docker version
  • [ ] Docker command, image and tag used
  • [ ] Kernel version from uname -a
  • [ ] Any relevant kernel output lines from dmesg
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • [ ] NVIDIA container library version from nvidia-container-cli -V
  • [ ] NVIDIA container library logs (see troubleshooting)

somethingwentwell avatar Nov 28 '22 09:11 somethingwentwell

What do the plugin logs look like, and what resources does your node say it has under Capacity and Allocatable when running kubectl get node?

klueska avatar Nov 28 '22 09:11 klueska

Here is the output of kubectl describe node

kubectl describe node server1
Name:               server1
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=server1
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.122.148/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.233.79.64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 28 Nov 2022 09:16:27 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  server1
  AcquireTime:     <unset>
  RenewTime:       Mon, 28 Nov 2022 10:22:21 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 28 Nov 2022 09:17:19 +0000   Mon, 28 Nov 2022 09:17:19 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:16:26 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:16:26 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:16:26 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:18:05 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.122.148
  Hostname:    server1
Capacity:
  cpu:                4
  ephemeral-storage:  204794888Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             4025584Ki
  pods:               110
Allocatable:
  cpu:                3800m
  ephemeral-storage:  188738968469
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3398896Ki
  pods:               110
System Info:
  Machine ID:                 695b8befa78a443a950c66a055df670a
  System UUID:                695b8bef-a78a-443a-950c-66a055df670a
  Boot ID:                    e70f1479-1827-4387-b052-7e9a1a0d7211
  Kernel Version:             5.4.0-132-generic
  OS Image:                   Ubuntu 20.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.10
  Kubelet Version:            v1.25.4
  Kube-Proxy Version:         v1.25.4
PodCIDR:                      10.233.64.0/24
PodCIDRs:                     10.233.64.0/24
Non-terminated Pods:          (14 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  cert-manager                cert-manager-55b8b5b94f-bxxbw               0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  cert-manager                cert-manager-cainjector-655669b754-dd7qr    0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  cert-manager                cert-manager-webhook-77d689b6df-xq25h       0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 calico-kube-controllers-d6484b75c-b2d6v     30m (0%)      1 (26%)     64M (1%)         256M (7%)      65m
  kube-system                 calico-node-ndjpn                           150m (3%)     300m (7%)   64M (1%)         500M (14%)     65m
  kube-system                 coredns-588bb58b94-bhs45                    100m (2%)     0 (0%)      70Mi (2%)        300Mi (9%)     64m
  kube-system                 dns-autoscaler-d8bd87bcc-65cdd              20m (0%)      0 (0%)      10Mi (0%)        0 (0%)         64m
  kube-system                 kube-apiserver-server1                      250m (6%)     0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 kube-controller-manager-server1             200m (5%)     0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 kube-proxy-d8std                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 kube-scheduler-server1                      100m (2%)     0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 local-volume-provisioner-8q86f              0 (0%)        0 (0%)      0 (0%)           0 (0%)         64m
  kube-system                 nodelocaldns-xlx8j                          100m (2%)     0 (0%)      70Mi (2%)        200Mi (6%)     64m
  kube-system                 nvidia-device-plugin-daemonset-7989w        0 (0%)        0 (0%)      0 (0%)           0 (0%)         54m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests        Limits
  --------           --------        ------
  cpu                950m (25%)      1300m (34%)
  memory             285286400 (8%)  1280288k (36%)
  ephemeral-storage  0 (0%)          0 (0%)
  hugepages-1Gi      0 (0%)          0 (0%)
  hugepages-2Mi      0 (0%)          0 (0%)
Events:              <none>

somethingwentwell avatar Nov 28 '22 10:11 somethingwentwell

any update?

somethingwentwell avatar Dec 28 '22 04:12 somethingwentwell

It seems that the plugin is not advertising any GPUs. Can you post the logs of the plugin?

klueska avatar Jan 03 '23 12:01 klueska

Hi there!! Was the error solved? Because I am facing the same error and I am not able to solve it. Would be a huge help, if you could help me out here.

Todoroki02 avatar Aug 25 '23 16:08 Todoroki02

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]

Was the error solved?

xlcbingo1999 avatar Mar 03 '24 07:03 xlcbingo1999

The plugin log requested in https://github.com/NVIDIA/k8s-device-plugin/issues/348#issuecomment-1369699003 were never supplied. @xlcbingo1999 if you are stting similar behaviour, please provide a description of your setup as well as the plugin logs.

elezar avatar Mar 04 '24 07:03 elezar