GPU is not available with a GPU EC2 instance in EKS cluster (1.23)

Open garyyang6 opened this issue 1 year ago • 2 comments

1. Issue or feature description

In EKS (1.23), I launched an EC2 instance (Ubuntu) with the instance type G5.2xlarge. However, GPU is not available.

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"

NAME                                         GPU
ip-10-2-1-197.us-west-2.compute.internal   <none>

2. Steps to reproduce the issue

I enabled GPU support by deploying the nvidia-device-plugin-daemonset kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Deploy a pod.

apiVersion: v1
kind: Pod
  name: gpu-pod
  restartPolicy: Never
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
          nvidia.com/gpu: 1 # requesting 1 GPU
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
    node.kubernetes.io/instance-type: g5.2xlarge

Login to this Ubuntu EC2 instance. I execute command as follows. It shows that there is one GPU with this instance.

sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

Tue Nov 15 01:00:42 2022
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   12C    P8    14W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [ ] The output of nvidia-smi -a on your host sudo nvidia-smi -a
==============NVSMI LOG==============

Timestamp                                 : Tue Nov 15 01:06:47 2022
Driver Version                            : 510.85.02
CUDA Version                              : 11.6

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1321321008039
    GPU UUID                              : GPU-2600e701-8d2f-704c-06bd-ca16a9306dfe
    Minor Number                          : 0
    VBIOS Version                         :
    MultiGPU Board                        : No
    Board ID                              : 0x1e
    GPU Part Number                       : 900-2G133-A840-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G133.0210.00.04
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    vGPU Software Licensed Product
        Product Name                      : NVIDIA RTX Virtual Workstation
        License Status                    : Licensed (Expiry: N/A)
        Relaxed Ordering Mode             : N/A
        Bus                               : 0x00
        Device                            : 0x1E
        Domain                            : 0x0000
        Device Id                         : 0x223710DE
        Bus Id                            : 00000000:00:1E.0
        Sub System Id                     : 0x152F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 23028 MiB
        Reserved                          : 296 MiB
        Used                              : 0 MiB
        Free                              : 22731 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 1 MiB
        Free                              : 32767 MiB
    Compute Mode                          : Default
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
        GPU Current Temp                  : 12 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 88 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 17.71 W
        Power Limit                       : 300.00 W
        Default Power Limit               : 300.00 W
        Enforced Power Limit              : 300.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1710 MHz
        Memory                            : 6251 MHz
    Default Applications Clocks
        Graphics                          : 1710 MHz
        Memory                            : 6251 MHz
    Max Clocks
        Graphics                          : 1710 MHz
        SM                                : 1710 MHz
        Memory                            : 6251 MHz
        Video                             : 1500 MHz
    Max Customer Boost Clocks
        Graphics                          : 1710 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
        Graphics                          : 700.000 mV
    Processes                             : None
  • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json) sudo cat /etc/docker/daemon.json
$ sudo cat /etc/docker/daemon.json
   "default-runtime": "nvidia",
   "runtimes": {
       "nvidia": {
           "path": "nvidia-container-runtime",
           "runtimeArgs": []
  • [ ] The k8s-device-plugin container logs How to get k8s-device-plugin container logs?
  • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
$ sudo journalctl -r -u kubelet
-- Logs begin at Mon 2022-11-14 23:28:14 UTC, end at Tue 2022-11-15 01:12:27 UTC. --
-- No entries --

Additional information that might help better understand your environment and reproduce the bug:

  • [ ] Docker version from docker version

sudo docker version

Client: Docker Engine - Community
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        baeda1f
 Built:             Tue Oct 25 18:02:21 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 18:00:04 2022
  OS/Arch:          linux/amd64
  Experimental:     false
  Version:          1.6.9
  GitCommit:        1c90a442489720eec95342e1789ee8a5e1b9536f
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
  Version:          0.19.0
  GitCommit:        de40ad0
  • [ ] Docker command, image and tag used
  • [ ] Kernel version from uname -a
uname -a
Linux ip-10-2-1-197 5.15.0-1022-aws #26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • [ ] Any relevant kernel output lines from dmesg No clue what info I should provide.
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'
sh: 1: _or_: not found
dpkg-query: no packages found matching *nvidia*rpm
dpkg-query: no packages found matching -qa
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
un  libgldispatch0-nvidia         <none>       <none>       (no description available)
ii  libnvidia-container-tools     1.11.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.11.0-1     amd64        NVIDIA container runtime library
un  nvidia-container-runtime      <none>       <none>       (no description available)
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.11.0-1     amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base 1.11.0-1     amd64        NVIDIA Container Toolkit Base
un  nvidia-docker                 <none>       <none>       (no description available)
ii  nvidia-docker2                2.11.0-1     all          nvidia-docker CLI wrapper
  • [ ] NVIDIA container library version from nvidia-container-cli -V
 nvidia-container-cli -V
cli-version: 1.11.0
lib-version: 1.11.0
build date: 2022-09-06T09:21+00:00
build revision: c8f267be0bac1c654d59ad4ea5df907141149977
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

In Kubernetes 1.23 containerd is the default runtime in use. Have you configured containerd to use the nvidia-container-runtime as its default runtime?

