k8s-device-plugin GPU is not available with a GPU EC2 instance in EKS cluster (1.23)

GPU is not available with a GPU EC2 instance in EKS cluster (1.23)

Open garyyang6 opened this issue 1 year ago • 2 comments

1. Issue or feature description

In EKS (1.23), I launched an EC2 instance (Ubuntu) with the instance type G5.2xlarge. However, GPU is not available.

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"

NAME                                         GPU
ip-10-2-1-197.us-west-2.compute.internal   <none>

2. Steps to reproduce the issue

I enabled GPU support by deploying the nvidia-device-plugin-daemonset kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Deploy a pod.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    node.kubernetes.io/instance-type: g5.2xlarge

sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

Tue Nov 15 01:00:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   12C    P8    14W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[ ] The output of nvidia-smi -a on your host sudo nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Tue Nov 15 01:06:47 2022
Driver Version                            : 510.85.02
CUDA Version                              : 11.6

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1321321008039
    GPU UUID                              : GPU-2600e701-8d2f-704c-06bd-ca16a9306dfe
    Minor Number                          : 0
    VBIOS Version                         : 94.02.75.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x1e
    GPU Part Number                       : 900-2G133-A840-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G133.0210.00.04
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    vGPU Software Licensed Product
        Product Name                      : NVIDIA RTX Virtual Workstation
        License Status                    : Licensed (Expiry: N/A)
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x1E
        Domain                            : 0x0000
        Device Id                         : 0x223710DE
        Bus Id                            : 00000000:00:1E.0
        Sub System Id                     : 0x152F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 23028 MiB
        Reserved                          : 296 MiB
        Used                              : 0 MiB
        Free                              : 22731 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 1 MiB
        Free                              : 32767 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 12 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 88 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 17.71 W
        Power Limit                       : 300.00 W
        Default Power Limit               : 300.00 W
        Enforced Power Limit              : 300.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1710 MHz
        Memory                            : 6251 MHz
    Default Applications Clocks
        Graphics                          : 1710 MHz
        Memory                            : 6251 MHz
    Max Clocks
        Graphics                          : 1710 MHz
        SM                                : 1710 MHz
        Memory                            : 6251 MHz
        Video                             : 1500 MHz
    Max Customer Boost Clocks
        Graphics                          : 1710 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 700.000 mV
    Processes                             : None

[ ] Your docker configuration file (e.g: /etc/docker/daemon.json) sudo cat /etc/docker/daemon.json

$ sudo cat /etc/docker/daemon.json
{
   "default-runtime": "nvidia",
   "runtimes": {
       "nvidia": {
           "path": "nvidia-container-runtime",
           "runtimeArgs": []
       }
   }
}

[ ] The k8s-device-plugin container logs How to get k8s-device-plugin container logs?
[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

$ sudo journalctl -r -u kubelet
-- Logs begin at Mon 2022-11-14 23:28:14 UTC, end at Tue 2022-11-15 01:12:27 UTC. --
-- No entries --

Additional information that might help better understand your environment and reproduce the bug:

[ ] Docker version from docker version

sudo docker version

Client: Docker Engine - Community
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        baeda1f
 Built:             Tue Oct 25 18:02:21 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 18:00:04 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.9
  GitCommit:        1c90a442489720eec95342e1789ee8a5e1b9536f
 nvidia:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

[ ] Docker command, image and tag used
[ ] Kernel version from uname -a

uname -a
Linux ip-10-2-1-197 5.15.0-1022-aws #26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

[ ] Any relevant kernel output lines from dmesg No clue what info I should provide.
[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'
sh: 1: _or_: not found
dpkg-query: no packages found matching *nvidia*rpm
dpkg-query: no packages found matching -qa
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
un  libgldispatch0-nvidia         <none>       <none>       (no description available)
ii  libnvidia-container-tools     1.11.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.11.0-1     amd64        NVIDIA container runtime library
un  nvidia-container-runtime      <none>       <none>       (no description available)
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.11.0-1     amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base 1.11.0-1     amd64        NVIDIA Container Toolkit Base
un  nvidia-docker                 <none>       <none>       (no description available)
ii  nvidia-docker2                2.11.0-1     all          nvidia-docker CLI wrapper

[ ] NVIDIA container library version from nvidia-container-cli -V

 nvidia-container-cli -V
cli-version: 1.11.0
lib-version: 1.11.0
build date: 2022-09-06T09:21+00:00
build revision: c8f267be0bac1c654d59ad4ea5df907141149977
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$

[ ] NVIDIA container library logs (see troubleshooting)

Nov 15 '22 01:11 garyyang6

In Kubernetes 1.23 containerd is the default runtime in use. Have you configured containerd to use the nvidia-container-runtime as its default runtime?

Jan 03 '23 12:01 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Feb 28 '24 04:02 github-actions[bot]

k8s-device-plugin k8s-device-plugin copied to clipboard

GPU is not available with a GPU EC2 instance in EKS cluster (1.23)

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

k8s-device-plugin
k8s-device-plugin copied to clipboard