gvisor
gvisor copied to clipboard
Running runsc with containerd and `--nvproxy=true` removes NVIDIA drivers from container in Kubernetes
Description
Hello. I'm trying to get gVisor to work with NVIDIA drivers in Kubernetes, using the regular AWS EKS Amazon Linux 2 AMI (not the GPU one). I can confirm that both work separately; however, I'm having a lot of troubles getting gVisor to work with the NVIDIA drivers. When I try to run the nvidia:cuda image using the gVisor runtime class, I can see that the environment variables are correctly set, but the nvidia-smi binary is gone. These are all the files I'm using:
config.toml
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2
[grpc]
address = "/run/containerd/containerd.sock"
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
ConfigPath = "/etc/containerd/runsc.toml"
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
runsc.toml
log_path = "/var/log/runsc/%ID%/shim.log"
log_level = "debug"
[runsc_config]
nvproxy = "true"
debug = "true"
debug-log = "/var/log/runsc/%ID%/gvisor.%COMMAND%.log"
Test pod:
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: nvidia-version-check
spec:
runtimeClassName: gvisor
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.test.io/gpu
operator: Exists
tolerations:
- key: node-role.test.io/gpu
operator: Exists
effect: NoSchedule
restartPolicy: OnFailure
containers:
- name: nvidia-version-check
image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
command: ["tail", "-f", "/dev/null"]
resources:
limits:
nvidia.com/gpu: "1"
EOF
Execing in to the pod:
❯ k exec -it nvidia-version-check -- bash
root@nvidia-version-check:/# env | grep NVIDIA
NVIDIA_VISIBLE_DEVICES=GPU-873dadb3-e07f-436d-abc6-4bcea3b3a9e2
NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419
NVIDIA_DRIVER_CAPABILITIES=compute,utility
root@nvidia-version-check:/# dmesg
[ 0.000000] Starting gVisor...
[ 0.436519] Generating random numbers by fair dice roll...
[ 0.723402] Checking naughty and nice process list...
[ 0.822312] Creating cloned children...
[ 0.918708] Committing treasure map to memory...
[ 1.208497] Daemonizing children...
[ 1.504345] Mounting deweydecimalfs...
[ 1.944738] Creating bureaucratic processes...
[ 1.948269] Constructing home...
[ 2.122704] Synthesizing system calls...
[ 2.155253] Searching for needles in stacks...
[ 2.579675] Setting up VFS...
[ 2.866641] Setting up FUSE...
[ 3.103859] Ready!
root@nvidia-version-check:/# ls /usr/local/cuda
compat lib64 targets
root@nvidia-version-check:/# which nvidia-smi
I have the NVIDIA Plugin DaemonSet running using the nvidia runtime class.
Steps to reproduce
# Install NVIDIA drivers and container toolkit
sudo yum install -y gcc kernel-devel-$(uname -r)
DRIVER_VERSION=525.60.13
curl -fSsl -O "https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run"
chmod +x NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
sudo CC=/usr/bin/gcc10-cc ./NVIDIA-Linux-x86_64-$DRIVER_VERSION.run --silent
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit
# Install runsc and containerd-shim-runsc-v1
ARCH=$(uname -m)
URL=https://storage.googleapis.com/gvisor/releases/release/latest/$ARCH
wget $URL/runsc $URL/containerd-shim-runsc-v1
chmod a+rx runsc containerd-shim-runsc-v1
sudo mv runsc containerd-shim-runsc-v1 /usr/bin
# Update `/etc/containerd/config.toml` to match the one above
# Update `/etc/containerd/runsc.toml` to match the one above`
# Restart containerd
sudo systemctl restart containerd
# Deploy the NVIDIA Plugin daemonset
# (update the affinity to only be scheduled to nodes with GPUs)
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
# Create the runtime classes
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
EOF
# Create the above pod and exec to it
runsc version
runsc version release-20230904.0
spec: 1.1.0-rc.1
### docker version (if using docker)
```shell
not using docker
uname
Linux ip-10-253-32-249.ec2.internal 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.16-eks-2d98532", GitCommit:"af930c12e26ef9d1e8fac7e3532ff4bcc1b2b509", GitTreeState:"clean", BuildDate:"2023-07-28T16:52:47Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
### repo state (if built from source)
_No response_
### runsc debug logs (if available)
_No response_
Adding the logs for the container: logs.zip
Thanks for the very detailed report! Apologies for the delay. nvproxy is not supported with k8s-device-plugin yet, and we haven't investigated what needs to be done to add support. We would appreciate OSS contributions!
We are currently focused on establishing support in GKE. GKE uses a different GPU+container stack. It does not use k8s-device-plugin. It instead has its own device plugin: https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu. This configures the container in a different way. nvproxy in GKE is still experimental, but it works! Please let me know if you want to experiment on GKE, and we can provide more detailed instructions.
To summarize, nvproxy works in the following environments:
- Docker:
docker run --gpus= ...Needs--nvproxy-dockerflag. - nvidia-container-runtime with legacy mode. Needs
--nvproxy-dockerflag. - GKE. Does not need
--nvproxy-dockerflag.
Thanks for the followup @ayushr2. In the meantime I've made some progress where by just using nvproxy, bootstrapping the host node with NVIDIA and then mounting the driver to the container using hostPath gets me to run nvidia-smi successfully. However, it seems it can't fully access the GPU:
==============NVSMI LOG==============
Timestamp : Mon Oct 30 15:53:01 2023
Driver Version : 525.60.13
CUDA Version : 12.0
Attached GPUs : 1
GPU 00000000:00:1E.0
Product Name : Tesla T4
Product Brand : NVIDIA
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : GPU access blocked by the operating system
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : GPU access blocked by the operating system
GPU UUID : GPU-3ec3e89a-b2ec-68d1-bb38-3becc2cf55cd
Minor Number : 0
VBIOS Version : Unknown Error
MultiGPU Board : No
Board ID : 0x1e
Board Part Number : GPU access blocked by the operating system
GPU Part Number : GPU access blocked by the operating system
Module ID : GPU access blocked by the operating system
Inforom Version
Image Version : GPU access blocked by the operating system
OEM Object : Unknown Error
ECC Object : GPU access blocked by the operating system
Power Management Object : Unknown Error
GPU Operation Mode
Current : GPU access blocked by the operating system
Pending : GPU access blocked by the operating system
GSP Firmware Version : 525.60.13
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x1E
Domain : 0x0000
Device Id : 0x1EB810DE
Bus Id : 00000000:00:1E.0
Sub System Id : 0x12A210DE
GPU Link Info
PCIe Generation
Max : Unknown Error
Current : Unknown Error
Device Current : Unknown Error
Device Max : Unknown Error
Host Max : Unknown Error
Link Width
Max : Unknown Error
Current : Unknown Error
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : GPU access blocked by the operating system
Replay Number Rollovers : GPU access blocked by the operating system
Tx Throughput : GPU access blocked by the operating system
Rx Throughput : GPU access blocked by the operating system
Atomic Caps Inbound : GPU access blocked by the operating system
Atomic Caps Outbound : GPU access blocked by the operating system
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15360 MiB
Reserved : 399 MiB
Used : 2 MiB
Free : 14957 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : GPU access blocked by the operating system
Average FPS : GPU access blocked by the operating system
Average Latency : GPU access blocked by the operating system
FBC Stats
Active Sessions : GPU access blocked by the operating system
Average FPS : GPU access blocked by the operating system
Average Latency : GPU access blocked by the operating system
Ecc Mode
Current : GPU access blocked by the operating system
Pending : GPU access blocked by the operating system
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : GPU access blocked by the operating system
Double Bit ECC : GPU access blocked by the operating system
Pending Page Blacklist : GPU access blocked by the operating system
Remapped Rows : GPU access blocked by the operating system
Temperature
GPU Current Temp : 22 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 13.16 W
Power Limit : 70.00 W
Default Power Limit : 70.00 W
Enforced Power Limit : 70.00 W
Min Power Limit : 60.00 W
Max Power Limit : 70.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 1590 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1590 MHz
SM : 1590 MHz
Memory : 5001 MHz
Video : 1470 MHz
Max Customer Boost Clocks
Graphics : 1590 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes : None
I tried to also run this under runtimeClass: nvidia and this didn't happen, so it's definitely a gVisor issue now. Unfortunately for our use case GKE is not viable. I'll try with the options you described to see if I can get it working.
However, it seems it can't fully access the GPU
Yeah I don't think it will work just yet. In GKE, the container spec defines which GPUs to expose in spec.Linux.Devices. However, in the boot logs you attached above, I could not see any such devices defined. So gvisor will not expose any device.
My best guess is that k8s-device-plugin is creating bind mounts of /dev/nvidia* devices in the container's root filesystem and then expecting the container to be able to access that. That won't work with gVisor with any combination of our --nvproxy flags, because even though the devices exist on the host filesystem, they don't exist in our sentry's /dev filesystem (which is an in-memory filesystem).
In docker mode, the GPU devices are explicitly exposed like this. In GKE, the device files are automatically created here because spec.Linux.Devices defines it. So you could look into adding similar support for k8s-device-plugin environment.
Thanks for the detailed reply @ayushr2! Though I'm a bit out of my depth here, your guidance has been very helpful. I'm trying to better understand the differences for GKE; could you please point me to where the container spec/sandbox is defined? I'm not sure if it's possible to try to port that configuration over to Amazon Linux or if I should just try to add the feature directly to the gVisor code you pointed me to.
I've tried very naively to add the following snipped to runsc/boot/vfs.go:createDeviceFiles:
mode := os.FileMode(int(0777))
info.spec.Linux.Devices = append(info.spec.Linux.Devices, []specs.LinuxDevice{
{
Path: "/dev/nvidia0",
Type: "c",
Major: 195,
Minor: 0,
FileMode: &mode,
},
{
Path: "/dev/nvidia-modeset",
Type: "c",
Major: 195,
Minor: 254,
FileMode: &mode,
},
{
Path: "/dev/nvidia-uvm",
Type: "c",
Major: 245,
Minor: 0,
FileMode: &mode,
},
{
Path: "/dev/nvidia-uvm-tools",
Type: "c",
Major: 245,
Minor: 1,
FileMode: &mode,
},
}...)
in order to try to mount the devices during runtime, but seems like even this isn't enough
You probably also want /dev/nvidiactl. You basically want to call this. Usually that is only called for --nvproxy-docker. JUST FOR TESTING try adding a new flag --nvproxy-k8s and change the condition on line 1221 to be if info.conf.NVProxyDocker || info.conf.NVProxyK8s { ...
Also note that the minor number of /dev/nvidia-uvm is different inside the sandbox. So just copying from host won't work.
Yeah, from reading the code and looking at the logs seems like gVisor automatically assigns a minor number to the device. Unfortunately your suggestion still didn't work. I'll leave the logs for the container in case you (or anyone that comes across this issue) want to use it for debugging (note that I had already added a nvproxy-automount-dev flag for the same purpose you suggested using nvproxy-k8s)
runsc.tar.gz
Got it, thanks for working with me on this.
Just to set the expectations, adding support for k8s-device-plugin is currently not on our roadmap. We are focused on maturing GPU support in GKE first. OSS contributions for GPU support in additional environments is appreciated in the meantime!
No worries! In the meantime, we don't have a strict requirement for having NVIDIA working with gVisor so we can get around it. I'd love to help bringing in this feature but it would still need to get more familiarized with gVisor, but I'll help in any way I can!
A friendly reminder that this issue had no activity for 120 days.
@PedroRibeiro95 Have you done any new researches on this? I was doing some researches on this and it looks like it should be working with following configurations:
k8s-device-plugin does have a config called DEVICE_LIST_STRATEGY, which does allow device list to be returned back as CDI. Once kubelet receive allocate response from device plugin, then it should populate CDI spec file, and start containerd ( assuming we are just using containerd ). Then the containerd will parse CDI devices and convert device to oci spec file and pass the spec to runs or runsc. Then runsc should just create linux devices here as @ayushr2 just described. (I am assuming in this case nvidia-runtime is not needed since we don't need prestart hook?)
I never tested anything, everything mentioned above is just pure guess from me, but let me know if my reasoning makes sense or not.
Hey @sfc-gh-hyu, thanks for the detailed instructions. I haven't revisited this in the meantime as other priorities came up, but I will be testing it again very soon. I will try to follow what you suggested and I will report back with more details.