Nested container can't start
1. Issue or feature description
On AWS EKS g4dn-xlarge node, inside a privileged container requesting GPU resource, a nested container failed with error:
mount "proc" to "/proc": Operation not permitted
2. Steps to reproduce the issue
- Create an EKS cluster with
g4dn-xlargenodes and also proper k8s labels on the nodes; - Create a privileged Pod (can use a container image like ubuntu:22.04) and claiming GPU resource;
- Inside the Pod, install OCI runtime (e.g.
apt-get install runc); - Prepare a minimum rootfs
- Create an OCI spec which creates all new namespaces: user, ipc, mount, net, uts, cgroup etc.
- Add a "proc" mount to "/proc"
- Run a container using that OCI spec.
To reproduce this issue, using unshare and mount -N maybe simpler than writing a full OCI spec.
3. Root cause
The reason causing `mount "proc" to "/proc": Operation not permitted" is: nvidia container runtime will create the following mountpoints on the outer container:
/proc/driver/nvidia/gpus/BUS/.../proc/driver/nvidia
After unmount these mountpoints, the nested container can be started without issue.
4. Thoughts
Not sure why nvidia container runtime will create mountpoints under "/proc". Based on observation, without the mountpoints, the files like /proc/driver/nvidia/gpus/... and /proc/driver/nvidia are visible and accessible to the Pod. Is that for the isolation purpose in case there are multiple GPU devices on the system and only allowing the Pod to see the allocated device?
We also experimented on GKE, which doesn't have this issue. We don't see the mountpoints on /proc on GKE.
The NVIDIA Container CLI ensures that only the proc paths for devices requested are mounted into the container. The /proc/driver/nvidia/params file is also updated to ensure that tools such as nvidia-smi don't create the device nodes for devices not requested.
Since you mention GKE did you install the NVIDIA Container Runtime there, or are you launching a pod using their device plugin?
Thanks @elezar for explanation!
Regarding GKE, we followed https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, and we didn't dig deeper into what's configured on the VM, and we didn't do specific things on the VMs.
@easeway the default GKE installation does not use the NVIDIA Container Toolkit which would explain the different experience there. We are working on aligning things getter across the Cloud providers and including better support for nested containers.
@elezar Thanks! I'm looking forward to it!