nvidia-docker Nested container can't start

1. Issue or feature description

On AWS EKS g4dn-xlarge node, inside a privileged container requesting GPU resource, a nested container failed with error:

mount "proc" to "/proc": Operation not permitted

2. Steps to reproduce the issue

Create an EKS cluster with g4dn-xlarge nodes and also proper k8s labels on the nodes;
Create a privileged Pod (can use a container image like ubuntu:22.04) and claiming GPU resource;
Inside the Pod, install OCI runtime (e.g. apt-get install runc);
Prepare a minimum rootfs
Create an OCI spec which creates all new namespaces: user, ipc, mount, net, uts, cgroup etc.
Add a "proc" mount to "/proc"
Run a container using that OCI spec.

To reproduce this issue, using unshare and mount -N maybe simpler than writing a full OCI spec.

3. Root cause

The reason causing `mount "proc" to "/proc": Operation not permitted" is: nvidia container runtime will create the following mountpoints on the outer container:

/proc/driver/nvidia/gpus/BUS/...
/proc/driver/nvidia

After unmount these mountpoints, the nested container can be started without issue.

4. Thoughts

Not sure why nvidia container runtime will create mountpoints under "/proc". Based on observation, without the mountpoints, the files like /proc/driver/nvidia/gpus/... and /proc/driver/nvidia are visible and accessible to the Pod. Is that for the isolation purpose in case there are multiple GPU devices on the system and only allowing the Pod to see the allocated device?

We also experimented on GKE, which doesn't have this issue. We don't see the mountpoints on /proc on GKE.

May 11 '22 04:05 easeway

The NVIDIA Container CLI ensures that only the proc paths for devices requested are mounted into the container. The /proc/driver/nvidia/params file is also updated to ensure that tools such as nvidia-smi don't create the device nodes for devices not requested.

Since you mention GKE did you install the NVIDIA Container Runtime there, or are you launching a pod using their device plugin?

May 11 '22 07:05 elezar

Thanks @elezar for explanation!

Regarding GKE, we followed https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, and we didn't dig deeper into what's configured on the VM, and we didn't do specific things on the VMs.

May 11 '22 16:05 easeway

@easeway the default GKE installation does not use the NVIDIA Container Toolkit which would explain the different experience there. We are working on aligning things getter across the Cloud providers and including better support for nested containers.

May 12 '22 13:05 elezar

@elezar Thanks! I'm looking forward to it!

May 12 '22 16:05 easeway