singularity icon indicating copy to clipboard operation
singularity copied to clipboard

Read NVIDIA_xxx environment from container, not host

Open dtrudg opened this issue 4 years ago • 1 comments

Describe the solution you'd like

The nvidia docker runtime reads NVIDIA_xxx environment variables that control GPU setup for the container from the container's environment, not the host. This means a container can use an ENV NVIDIA_VISIBLE_DEVICES=all to trigger the runtime to pass in GPUs. It can also specify NVIDIA_REQUIRE_ vars to check cuda versions available when it is run, etc. The runtime adds a flag --gpus to override, or manually pass in a GPU to a container.

We currently read these NVIDIA_ environment variables from the host in the experimental --nvccli code. So, the host, rather than the container, decides how nvidia-container-cli will be instructed.

We should really switch to reading them out of the container environment, and implementing the --gpus flag (#360) and a SINGULARITY_GPUS env var for host-side GPU configuration override / options.

dtrudg avatar Oct 11 '21 15:10 dtrudg

This is rather involved due to our environment setup withing the container being performed by a shell interpreter running the action script, which in turn will source other scripts in the container that influence the final environment. It is a long way from simply reading the env vars out of an OCI runtime config.

See:

https://github.com/sylabs/singularity/blob/fba2dfa94784b872abbc42377f4bc172c6911739/internal/pkg/runtime/engine/singularity/process_linux.go#L799

https://github.com/sylabs/singularity/blob/fba2dfa94784b872abbc42377f4bc172c6911739/internal/pkg/util/fs/files/action_scripts.go#L9

This feels a lot like something that we'd want to address in a 4.0, with thoughts about restructuring some of this, or capturing (some?) environment in a static config in the SIF. Otherwise we need to run a script in the container... mostly re-implementing the action script but without the exec of the container payload.

We can probably keep the current 'experimental' approach of using the NVIDIA_ vars from the host in 3.9, and just add the --gpu.

@tri-adam - would be good to have a chat about this at some point, especially thinking forward to 4.0.

dtrudg avatar Oct 11 '21 22:10 dtrudg

Closing this. We are moving toward OCI mode, and NVIDIAs CDI support.

dtrudg avatar Jul 17 '23 09:07 dtrudg