k8s-device-plugin
k8s-device-plugin copied to clipboard
GPU is not available with a GPU EC2 instance in EKS cluster (1.23)
1. Issue or feature description
In EKS (1.23), I launched an EC2 instance (Ubuntu) with the instance type G5.2xlarge. However, GPU is not available.
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"
NAME GPU
ip-10-2-1-197.us-west-2.compute.internal <none>
2. Steps to reproduce the issue
I enabled GPU support by deploying the nvidia-device-plugin-daemonset kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
Deploy a pod.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
node.kubernetes.io/instance-type: g5.2xlarge
Login to this Ubuntu EC2 instance. I execute command as follows. It shows that there is one GPU with this instance.
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Tue Nov 15 01:00:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 12C P8 14W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The output of
nvidia-smi -a
on your host sudo nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Tue Nov 15 01:06:47 2022
Driver Version : 510.85.02
CUDA Version : 11.6
Attached GPUs : 1
GPU 00000000:00:1E.0
Product Name : NVIDIA A10G
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1321321008039
GPU UUID : GPU-2600e701-8d2f-704c-06bd-ca16a9306dfe
Minor Number : 0
VBIOS Version : 94.02.75.00.01
MultiGPU Board : No
Board ID : 0x1e
GPU Part Number : 900-2G133-A840-000
Module ID : 0
Inforom Version
Image Version : G133.0210.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
vGPU Software Licensed Product
Product Name : NVIDIA RTX Virtual Workstation
License Status : Licensed (Expiry: N/A)
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x1E
Domain : 0x0000
Device Id : 0x223710DE
Bus Id : 00000000:00:1E.0
Sub System Id : 0x152F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 23028 MiB
Reserved : 296 MiB
Used : 0 MiB
Free : 22731 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 1 MiB
Free : 32767 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 12 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 17.71 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1710 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1710 MHz
Memory : 6251 MHz
Max Clocks
Graphics : 1710 MHz
SM : 1710 MHz
Memory : 6251 MHz
Video : 1500 MHz
Max Customer Boost Clocks
Graphics : 1710 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 700.000 mV
Processes : None
- [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json
) sudo cat /etc/docker/daemon.json
$ sudo cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
- [ ] The k8s-device-plugin container logs How to get k8s-device-plugin container logs?
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
$ sudo journalctl -r -u kubelet
-- Logs begin at Mon 2022-11-14 23:28:14 UTC, end at Tue 2022-11-15 01:12:27 UTC. --
-- No entries --
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version
sudo docker version
Client: Docker Engine - Community
Version: 20.10.21
API version: 1.41
Go version: go1.18.7
Git commit: baeda1f
Built: Tue Oct 25 18:02:21 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.21
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: 3056208
Built: Tue Oct 25 18:00:04 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.9
GitCommit: 1c90a442489720eec95342e1789ee8a5e1b9536f
nvidia:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -a
uname -a
Linux ip-10-2-1-197 5.15.0-1022-aws #26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- [ ] Any relevant kernel output lines from
dmesg
No clue what info I should provide. - [ ] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'
sh: 1: _or_: not found
dpkg-query: no packages found matching *nvidia*rpm
dpkg-query: no packages found matching -qa
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library
un nvidia-container-runtime <none> <none> (no description available)
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.11.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.11.0-1 amd64 NVIDIA Container Toolkit Base
un nvidia-docker <none> <none> (no description available)
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
- [ ] NVIDIA container library version from
nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.11.0
lib-version: 1.11.0
build date: 2022-09-06T09:21+00:00
build revision: c8f267be0bac1c654d59ad4ea5df907141149977
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$
- [ ] NVIDIA container library logs (see troubleshooting)
In Kubernetes 1.23 containerd is the default runtime in use. Have you configured containerd to use the nvidia-container-runtime as its default runtime?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.