k8s-device-plugin
k8s-device-plugin copied to clipboard
Getting GPU device minor number: Not Supported
1. Issue or feature description
helm install nvidia-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.12.2
nvidia-device-plugin-ctr logs
2022/09/06 15:24:00 Starting FS watcher.
2022/09/06 15:24:00 Starting OS watcher.
2022/09/06 15:24:00 Starting Plugins.
2022/09/06 15:24:00 Loading configuration.
2022/09/06 15:24:00 Initializing NVML.
2022/09/06 15:24:00 Updating config with default resource matching patterns.
2022/09/06 15:24:00
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "index"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2022/09/06 15:24:00 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error building GPU Device: error getting device paths: error getting GPU device minor number: Not Supported
goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc000010a30)
/build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e5c58?, {0xc0001cc460, 0x9, 0xe}, 0x9?)
/build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001cc460, 0x9, 0xe})
/build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001cc460?)
/build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001e8820, {0xca9328?, 0xc00003a050}, {0xc000032230, 0x1, 0x1})
/build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
/build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
/build/cmd/nvidia-device-plugin/main.go:91 +0x665
When I use ctr to run test gpu is ok
ctr run --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:nbody test-gpu /tmp/nbody -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1
> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 3GB]
9216 bodies, total time for 10 iterations: 7.467 ms
= 113.747 billion interactions per second
= 2274.931 single-precision GFLOP/s at 20 flops per interaction
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The output of
nvidia-smi -aon your host
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Tue Sep 6 15:30:06 2022
Driver Version : 516.94
CUDA Version : 11.7
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce GTX 1060 3GB
Product Brand : GeForce
Product Architecture : Pascal
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : WDDM
Pending : WDDM
Serial Number : N/A
GPU UUID : GPU-9445de88-eb50-477d-ff7c-5e0d77cdb203
Minor Number : N/A
VBIOS Version : 86.06.3c.00.2e
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1C0210DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x11C210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 8000 KB/s
Fan Speed : 42 %
Performance State : P5
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 3072 MiB
Reserved : 84 MiB
Used : 2407 MiB
Free : 580 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 3 %
Memory : 5 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 45 C
GPU Shutdown Temp : 102 C
GPU Slowdown Temp : 99 C
GPU Max Operating Temp : N/A
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 12.16 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 60.00 W
Max Power Limit : 140.00 W
Clocks
Graphics : 683 MHz
SM : 683 MHz
Memory : 810 MHz
Video : 607 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 4004 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
- [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json) - [ ] The k8s-device-plugin container logs
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet)
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Any relevant kernel output lines from
dmesg - [ ] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.10.0-1 all NVIDIA container runtime
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
- [ ] NVIDIA container library version from
nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
nvidia-container-cli list
/dev/dxg
/usr/lib/wsl/drivers/nv_dispi.inf_amd64_47917a79b8c7fd22/nvidia-smi
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libdxcore.so
continaerd config containerd.toml
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvdia"
disable_snapshot_annotations = true
discard_unpacked_layers = false
no_pivot = false
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
Runtime = "nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Runtime = "nvidia-container-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = false
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = ""
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]
[plugins."io.containerd.grpc.v1.cri".image_decryption]
key_model = "node"
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = ""
[plugins."io.containerd.grpc.v1.cri".registry.auths]
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.headers]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.runtime.v1.linux"]
no_shim = false
runtime = "nvidia-container-runtime"
runtime_root = ""
shim = "containerd-shim"
shim_debug = false
You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.
You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.
All right, I follow this guide https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl , to install cuda on wsl , Seem the limitations, https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps , There are no limit to install k8s on wsl and use ctr command run gpu as good as well
@elezar would you guys put this into the roadmap? our company is running Windows but we wanted to transition into Linux, so WSL2 seems like a natural choice. We are running deep learning workload that requires CUDA support and while Docker Desktop does support GPU workload, it would be strange to not see this work in normal WSL2 containers as well
Hi @elezar , in case it's unlikely to appear on the roadmap soon, could you please describe a rough plan of how the support should be added? And whether executing the plan would be doable by outside contributors? Thanks!
@patrykkaj I think that in theory this could be done by outside contributors and is simplified by the recent changes to support Tegra-based systems. What I can see happening here is that:
- we detect whether this is a WSL2 system (e.g. by checking for the presence of
dxcore.so.1) - modify / extend the NVML resource manager to create a device that does not require the device minor number.
Some things to note here:
- On WSL2 systems there is currently no option to select specific devices. This means that the available devices should be treated as a set and cannot be assigned to different containers. The other thing to note is that the device node (for use with the CPU manager workaround) on WSL2 systems is
/dev/dxgand not/dev/nvidia*.
If you feel comfortable creating an MR against https://gitlab.com/nvidia/kubernetes/device-plugin that adds this functionality, we can work together on getting it in.
Hello,
I was interested in this, and I adapted the plugin to work.
I pushed my version to GitLab (https://gitlab.com/Vinrobot/nvidia-kubernetes-device-plugin/-/tree/features/wsl2) and it works on my machine.
I also had to modify NVIDIA/gpu-monitoring-tools (https://github.com/Vinrobot/nvidia-gpu-monitoring-tools/tree/features/wsl2) to also use /dev/dxg.
I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?
@Vinrobot thanks for the work here. Some thoughts on this:
We recently moved away from nvidia-gpu-monitoring-tools and use bindings from go-nvml through go-nvlib instead.
I think the steps outlined in https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1262288033 should be considered as the starting point. Check if dxcore.so.1 is available and if it is assume a WSL2 system (one could also check for the existence of /dev/dxg here). In this case, create wslDevice that implements the deviceInfo Interface and ensure that this gets instatiated when enumerating devices. This can then return 0 for the minor number and return the correct path.
With regards to the following:
I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?
I don't think that this is required. If there are no NVIDIA GPUs available on the system then the NVML enumeration that is used to list the devices would not be expected to work. This should already be handled by the lower-level components of the NVIDIA container stack.
Hi @elezar, Thanks for the feedback.
I tried to make it work with the most recent version, but I got this error (on the pod)
Warning UnexpectedAdmissionError 30s kubelet Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: unsupported GPU device, which is unexpected
which is caused by this line in gpu-monitoring-tools (still used by gpuallocator).
As it's the same as before, I can re-use my custom version of gpu-monitoring-tools to make it work, but it's not the goal. Anyway, I will look into it tomorrow.
@Vinrobot yes, it is an issue that gpuallocator still uses gpu-monitoring-tools. It is on our roadmap to port it to the go-nvml bindings, but this is not yet complete.
The issue is the call to get alligned allocation here. (You can confirm this by removing this section).
If this does workd, what we would need is a mechanism to disable this for WSL2 devices.
One option would be to add a AllignedAllocationSupported() bool function to the Devices and Device types. This could look something like:
// AllignedAllocationSupported checks whether all devices support an alligned allocation
func (ds Devices) AllignedAllocationSupported() bool {
for _, d := range ds {
if !d.AllignedAllocationSupported() {
return false
}
}
return true
}
// AllignedAllocationSupported checks whether the device supports an alligned allocation
func (d Device) AllignedAllocationSupported() bool {
if d.IsMigDevice() {
return false
}
for _, p := range d.Paths {
if p == "/dev/dgx" {
return false
}
}
return true
}
(Note that this should still be discussed and could definitely be improved, but would be a good starting point).
Hi @elezar,
I'm also interested in running the device plugin with WSL2. I have created an MR https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291
Would be great to get those changes in.
Thanks @achim92 -- I will have a look at the MR.
Note that with the v1.13.0 release of the NVIDIA Container Toolkit we now support the generation of CDI specifications on WSL2 based systems. Support for consuming this and generating a spec for available devices was included in the v0.14.0 version of the device plugin. This was largely targeted at usage in the context of our GPU operator, but could be generalised to also support WSL2-based systems without requiring additional device plugin changes.
hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."
I0515 07:23:12.247146 1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248 1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352 1 main.go:176] Starting Plugins.
I0515 07:23:12.248389 1 main.go:234] Loading configuration.
I0515 07:23:12.248530 1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0515 07:23:12.248816 1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257 1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094 1 main.go:287] No devices found. Waiting indefinitely.
Thanks @elezar,
would be even better without requiring additional device plugin changes.
I have generated cdi with nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml:
cdiVersion: 0.3.0
containerEdits:
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi::/usr/bin/nvidia-smi
hookName: createContainer
path: /usr/bin/nvidia-ctk
- args:
- nvidia-ctk
- hook
- update-ldcache
- --folder
- /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4
- --folder
- /usr/lib/wsl/lib
hookName: createContainer
path: /usr/bin/nvidia-ctk
mounts:
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/lib/libdxcore.so
hostPath: /usr/lib/wsl/lib/libdxcore.so
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
options:
- ro
- nosuid
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
options:
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
options:
- ro
- nosuid
- nodev
- bind
devices:
- containerEdits:
deviceNodes:
- path: /dev/dxg
name: all
kind: nvidia.com/gpu
I also removed NVIDIA Container Runtime hook under /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json.
How can I enable CDI to make it work? I'm using cri-o as container runtime, so CDI support should be enabled by default.
I0515 08:39:51.471150 1 main.go:154] Starting FS watcher.
I0515 08:39:51.471416 1 main.go:161] Starting OS watcher.
I0515 08:39:51.472727 1 main.go:176] Starting Plugins.
I0515 08:39:51.472771 1 main.go:234] Loading configuration.
I0515 08:39:51.473017 1 main.go:242] Updating config with default resource matching patterns.
I0515 08:39:51.473350 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0515 08:39:51.473380 1 main.go:256] Retreiving plugins.
W0515 08:39:51.473833 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0515 08:39:51.474021 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0515 08:39:51.474878 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0515 08:39:51.474918 1 factory.go:115] Incompatible platform detected
E0515 08:39:51.474925 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0515 08:39:51.474930 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0515 08:39:51.474934 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0515 08:39:51.474937 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0515 08:39:51.474946 1 main.go:287] No devices found. Waiting indefinitely.
@elezar could you please give some guidance here?
hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."
I0515 07:23:12.247146 1 main.go:154] Starting FS watcher. I0515 07:23:12.247248 1 main.go:161] Starting OS watcher. I0515 07:23:12.248352 1 main.go:176] Starting Plugins. I0515 07:23:12.248389 1 main.go:234] Loading configuration. I0515 07:23:12.248530 1 main.go:242] Updating config with default resource matching patterns. I0515 07:23:12.248786 1 main.go:253] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0515 07:23:12.248816 1 main.go:256] Retreiving plugins. I0515 07:23:12.251257 1 factory.go:107] Detected NVML platform: found NVML library I0515 07:23:12.251330 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I0515 07:23:12.270094 1 main.go:287] No devices found. Waiting indefinitely.
Hi brother, I've encountered the same issue. Have you managed to solve it?
Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.
Hi @elezar ,
How can I test your changes? Do I need to create a new image and install the plugin to my k8s using https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml as a template?
Thanks
@elezar We are also interested in this
I believe registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 would be the right image right?
✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
WSL environment
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.3208
K8S Setup
≥ k3s --version
k3s version v1.26.4+k3s1 (8d0255af)
go version go1.19.8
nvidia-smi output in WSL
Tue Jul 25 16:36:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04 Driver Version: 536.25 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A2000 8GB Lap... On | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 3W / 40W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
##deleted processes table##
nvidia-device-plugin daemonset pod log
I0725 06:26:03.108417 1 main.go:154] Starting FS watcher.
I0725 06:26:03.108468 1 main.go:161] Starting OS watcher.
I0725 06:26:03.108974 1 main.go:176] Starting Plugins.
I0725 06:26:03.108995 1 main.go:234] Loading configuration.
I0725 06:26:03.109063 1 main.go:242] Updating config with default resource matching patterns.
I0725 06:26:03.109205 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0725 06:26:03.109219 1 main.go:256] Retrieving plugins.
I0725 06:26:03.113336 1 factory.go:107] Detected NVML platform: found NVML library
I0725 06:26:03.113372 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0725 06:26:03.138677 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0725 06:26:03.139033 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0725 06:26:03.143248 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Test GPU pod output
Used the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6
> Compute 8.6 CUDA device: [NVIDIA RTX A2000 8GB Laptop GPU]
20480 bodies, total time for 10 iterations: 25.066 ms
= 167.327 billion interactions per second
= 3346.542 single-precision GFLOP/s at 20 flops per interaction
Stream closed EOF for default/nbody-gpu-benchmark (cuda-container)
Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 !
@davidshen84 I can also confirm it works. However, we have to add some additional stuff:
$ touch /run/nvidia/validations/toolkit-ready
$ touch /run/nvidia/validations/driver-ready
$ mkdir -p /run/nvidia/driver/dev
$ ln -s /run/nvidia/driver/dev/dxg /dev/dxg
Annotate the WSL node:
nvidia.com/gpu-driver-upgrade-state: pod-restart-required
nvidia.com/gpu.count: '1'
nvidia.com/gpu.deploy.container-toolkit: 'true'
nvidia.com/gpu.deploy.dcgm: 'true'
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/gpu.deploy.driver: 'true'
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
nvidia.com/gpu.deploy.node-status-exporter: 'true'
nvidia.com/gpu.deploy.nvsm: ''
nvidia.com/gpu.deploy.operands: 'true'
nvidia.com/gpu.deploy.operator-validator: 'true'
nvidia.com/gpu.present: 'true'
nvidia.com/device-plugin.config: 'RTX-4070-Ti'
Change device plugin in ClusterPolicy:
devicePlugin:
config:
name: time-slicing-config
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: 'true'
- name: FAIL_ON_INIT_ERROR
value: 'true'
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: k8s-device-plugin
imagePullPolicy: IfNotPresent
repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
version: 8b416016
It should work for now:
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9
> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070 Ti]
61440 bodies, total time for 10 iterations: 34.665 ms
= 1088.943 billion interactions per second
= 21778.869 single-precision GFLOP/s at 20 flops per interaction
I created the "runtimeClassName" resource and added the "runtimeClassName" property to the pods.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
I did not add those properties you mentioned. Why do I need them?
Thanks
On Tue, 25 Jul 2023 at 19:32, wizpresso-steve-cy-fan < @.***> wrote:
@davidshen84 https://github.com/davidshen84 I can also confirm it works. However, we have to add some additional stuff:
$ touch /run/nvidia/validations/toolkit-ready $ touch /run/nvidia/validations/driver-ready $ mkdir -p /run/nvidia/driver/dev $ ln -s /run/nvidia/driver/dev/dxg /dev/dxg
Annotate the WSL node:
nvidia.com/gpu-driver-upgrade-state: pod-restart-required nvidia.com/gpu.count: '1' nvidia.com/gpu.deploy.container-toolkit: 'true' nvidia.com/gpu.deploy.dcgm: 'true' nvidia.com/gpu.deploy.dcgm-exporter: 'true' nvidia.com/gpu.deploy.device-plugin: 'true' nvidia.com/gpu.deploy.driver: 'true' nvidia.com/gpu.deploy.gpu-feature-discovery: 'true' nvidia.com/gpu.deploy.node-status-exporter: 'true' nvidia.com/gpu.deploy.nvsm: '' nvidia.com/gpu.deploy.operands: 'true' nvidia.com/gpu.deploy.operator-validator: 'true' nvidia.com/gpu.present: 'true' nvidia.com/device-plugin.config: 'RTX-4070-Ti'Change device plugin in ClusterPolicy:
devicePlugin: config: name: time-slicing-config enabled: true env: - name: PASS_DEVICE_SPECS value: 'true' - name: FAIL_ON_INIT_ERROR value: 'true' - name: DEVICE_LIST_STRATEGY value: envvar - name: DEVICE_ID_STRATEGY value: uuid - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all image: k8s-device-plugin imagePullPolicy: IfNotPresent repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging version: 8b416016
It should work for now
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTIBSURYWEGHQ4R5RIDXR6HERANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>
@davidshen84 Because I used the gpu-operator for automatic GPU provision
Thanks for the tip!
On Tue, 25 Jul 2023 at 19:46, wizpresso-steve-cy-fan < @.***> wrote:
@davidshen84 https://github.com/davidshen84 Because I used the gpu-operator https://github.com/NVIDIA/gpu-operator for automatic GPU provision
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649487959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTN72GY4JN43V7XZULLXR6IW7ANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>
I verified the staging imageregistry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 that it is truely working on wsl2.
Based on dockerd
Step 1, install k3s cluster based on dockerd
curl -sfL https://get.k3s.io | sh -s - --docker
Step 2, install dp with the staging image.
# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: docker
EOF
# install nvdp
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvdp \
--create-namespace \
--set=runtimeClassName=nvidia \
--set=image.repository=registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin \
--set=image.tag=8b416016
Based on containerd
Step 1, install k3s cluster based on containerd
curl -sfL https://get.k3s.io | sh -
Step 2, install dp with the staging image.
# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia # change the handler to `nvidia` for containerd
EOF
# install nvdp with the same steps as above.
Test with nvdp
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
And, the example cuda-sample-vectoradd can work normally.Waiting for the next working release on wsl2😃😃
Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.
Hi @elezar, I saw this PR has been merged in the upstream repository for a long time. What's the plan to publish this on GitHub?
Hi @elezar,
I can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 is working for me. Even my GPU card is Quadro P1000. :) I can move forward to test Koordiator.
itadmin@server:~/repos/k3s-on-wsl2$ cat /proc/version
Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023
itadmin@server:~/repos/k3s-on-wsl2$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Wed Aug 16 06:21:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14 Driver Version: 528.86 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P1000 On | 00000000:01:00.0 On | N/A |
| 34% 39C P8 N/A / 47W | 1061MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
+-----------------------------------------------------------------------------+
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl -n kube-system logs nvidia-device-plugin-daemonset-q642m
I0816 06:20:28.927429 1 main.go:154] Starting FS watcher.
I0816 06:20:28.927534 1 main.go:161] Starting OS watcher.
I0816 06:20:28.927691 1 main.go:176] Starting Plugins.
I0816 06:20:28.927698 1 main.go:234] Loading configuration.
I0816 06:20:28.927762 1 main.go:242] Updating config with default resource matching patterns.
I0816 06:20:28.927936 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0816 06:20:28.927960 1 main.go:256] Retrieving plugins.
I0816 06:20:28.930313 1 factory.go:107] Detected NVML platform: found NVML library
I0816 06:20:28.930362 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0816 06:20:28.947623 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0816 06:20:28.948059 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0816 06:20:28.949737 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl get nodes -o yaml
apiVersion: v1
items:
- apiVersion: v1
kind: Node
metadata:
annotations:
etcd.k3s.cattle.io/node-address: 172.18.88.17
etcd.k3s.cattle.io/node-name: server-d622491e
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"52:95:ba:16:e9:29"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 172.18.88.17
k3s.io/node-args: '["server","--cluster-init","true","--etcd-expose-metrics","true","--disable","traefik","--disable-cloud-controller","true","--docker","true","--kubelet-arg","node-status-update-frequency=4s","--kube-controller-manager-arg","node-monitor-period=2s","--kube-controller-manager-arg","node-monitor-grace-period=16s","--kube-apiserver-arg","default-not-ready-toleration-seconds=20","--kube-apiserver-arg","default-unreachable-toleration-seconds=20","--write-kubeconfig","/home/itadmin/.kube/config","--private-registry","/etc/rancher/k3s/registry.yaml","--flannel-iface","eth0","--bind-address","172.18.88.17","--https-listen-port","6443","--advertise-address","172.18.88.17","--log","/var/log/k3s-server.log"]'
k3s.io/node-config-hash: IDWWDZRIJO5DHZKGYYHONVZC2DN7TK7THKPSONCFR74ST4LAGNGQ====
k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/c26e7571d760c5f199d18efd197114f1ca4ab1e6ffe494f96feb65c87fcb8cf0"}'
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2023-08-16T05:47:03Z"
finalizers:
- wrangler.cattle.io/managed-etcd-controller
- wrangler.cattle.io/node
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: server
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: "true"
node-role.kubernetes.io/etcd: "true"
node-role.kubernetes.io/master: "true"
name: server
resourceVersion: "8151"
uid: 04b6a572-830c-4102-a9a9-15265e4f6a15
spec:
podCIDR: 10.42.0.0/24
podCIDRs:
- 10.42.0.0/24
status:
addresses:
- address: 172.18.88.17
type: InternalIP
- address: server
type: Hostname
allocatable:
cpu: "4"
ephemeral-storage: "1027046117185"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 32760580Ki
nvidia.com/gpu: "1"
pods: "110"
capacity:
cpu: "4"
ephemeral-storage: 1055762868Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 32760580Ki
nvidia.com/gpu: "1"
pods: "110"
conditions:
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:03Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:03Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:03Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:07Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- nvcr.io/nvidia/tensorflow@sha256:7b74f2403f62032db8205cf228052b105bd94f2871e27c1f144c5145e6072984
- nvcr.io/nvidia/tensorflow:20.03-tf2-py3
sizeBytes: 7440987700
- names:
- 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin@sha256:35ef4e7f7070e9ec0c9d9f9658200ce2dd61b53a436368e8ea45ec02ced78559
- 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
sizeBytes: 298298015
- names:
- 192.168.0.96:5000/nvidia/k8s-device-plugin@sha256:68fa1607030680a5430ee02cf4fdce040c99436d680ae24ba81ef5bbf4409e8e
- nvcr.io/nvidia/k8s-device-plugin@sha256:15c4280d13a61df703b12d1fd1b5b5eec4658157db3cb4b851d3259502310136
- 192.168.0.96:5000/nvidia/k8s-device-plugin:v0.14.1
- nvcr.io/nvidia/k8s-device-plugin:v0.14.1
sizeBytes: 298277535
- names:
- nvidia/cuda@sha256:4b0c83c0f2e66dc97b52f28c7acf94c1461bfa746d56a6f63c0fef5035590429
- nvidia/cuda:11.6.2-base-ubuntu20.04
sizeBytes: 153991389
- names:
- rancher/mirrored-metrics-server@sha256:16185c0d4d01f8919eca4779c69a374c184200cd9e6eded9ba53052fd73578df
- rancher/mirrored-metrics-server:v0.6.2
sizeBytes: 68892890
- names:
- rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61
- rancher/mirrored-coredns-coredns:1.9.4
sizeBytes: 49802873
- names:
- rancher/local-path-provisioner@sha256:db1a3225290dd8be481a1965fc7040954d0aa0e1f86a77c92816d7c62a02ae5c
- rancher/local-path-provisioner:v0.0.23
sizeBytes: 37443889
- names:
- rancher/mirrored-pause@sha256:74c4244427b7312c5b901fe0f67cbc53683d06f4f24c6faee65d4182bf0fa893
- rancher/mirrored-pause:3.6
sizeBytes: 682696
nodeInfo:
architecture: amd64
bootID: de2732a0-17d9-4272-a205-7b9ac1103e2b
containerRuntimeVersion: docker://20.10.25
kernelVersion: 5.15.90.1-microsoft-standard-WSL2
kubeProxyVersion: v1.26.3+k3s1
kubeletVersion: v1.26.3+k3s1
machineID: 53da58bf9ac14c33847a4b6e1269419b
operatingSystem: linux
osImage: Ubuntu 22.04.3 LTS
systemUUID: 53da58bf9ac14c33847a4b6e1269419b
kind: List
metadata:
resourceVersion: ""
Tested and documented in qbo with:
- Windows 11
- WSL2
- Docker cgroup v2
- Nvidia GPU operator
- Kubeflow
https://docs.qbo.io/#/ai_and_ml?id=kubeflow
Thanks to @achim92 contrib and @elezar approval :)
Please note that in Linux default helm chart works in
qboandkindso there is no need for this.
This fix also works for kind kubernetes using accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml
and
extraMounts:
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
More details see here:
https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275
A couple of notes for the gpu-operator
Labels
Nvidia GPU operator requires a manual label: feature.node.kubernetes.io/pci-10de.present=true for node-feature-discovery to add all necessary labels for the GPU operator to work. This applies only to kind and qbo not sure why k8s requires more labels as indicated here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259
The label can be added as follows:
for i in $(kubectl get no --selector '!node-role.kubernetes.io/control-plane' -o json | jq -r '.items[].metadata.name'); do
kubectl label node $i feature.node.kubernetes.io/pci-10de.present=true
done
The reson is that WSL2 doesn't contains PCI info under /sys and node-feature-discovery is unable detect the GPU
I believe the relevant code is here: node-feature-discovery/source/usb/utils.go:106
I believe node-feature-discovery is expecting something like the output below to build 10de label
lspci -nn |grep -i nvidia
0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] [10de:2560] (rev a1)
0000:01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:228e] (rev a1)
I believe the right place to add this label is once the driver has been detected in the host. See here
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs
I'll add my comments there.
Docker Image for device-plugin
I built a new image based on https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 for testing purposes but also working with the one provided here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649010456
git branch
* device-plugin-wsl2
Docker Image for gpu-operator
I created docker image with changes similar to this
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs
Docker Image for gpu-operator-validator
Blogs on how to install: Nvidia GPU Operator + Kubeflow + Docker in Docker + cgroups v2 (In Linux and Windows WSL2)
Thank you for working on this, now that WSL2 supports systemd I think more people will be running k8s on Windows.
Can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 working on kubeadm deployed cluster with Driver Version: 551.23 and 2080ti.
Just a general note: We will release a v0.15.0-rc.1 of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.