singularity
singularity copied to clipboard
Read NVIDIA_xxx environment from container, not host
Describe the solution you'd like
The nvidia docker runtime reads NVIDIA_xxx environment variables that control GPU setup for the container from the container's environment, not the host. This means a container can use an ENV NVIDIA_VISIBLE_DEVICES=all to trigger the runtime to pass in GPUs. It can also specify NVIDIA_REQUIRE_ vars to check cuda versions available when it is run, etc. The runtime adds a flag --gpus to override, or manually pass in a GPU to a container.
We currently read these NVIDIA_ environment variables from the host in the experimental --nvccli code. So, the host, rather than the container, decides how nvidia-container-cli will be instructed.
We should really switch to reading them out of the container environment, and implementing the --gpus flag (#360) and a SINGULARITY_GPUS env var for host-side GPU configuration override / options.
This is rather involved due to our environment setup withing the container being performed by a shell interpreter running the action script, which in turn will source other scripts in the container that influence the final environment. It is a long way from simply reading the env vars out of an OCI runtime config.
See:
https://github.com/sylabs/singularity/blob/fba2dfa94784b872abbc42377f4bc172c6911739/internal/pkg/runtime/engine/singularity/process_linux.go#L799
https://github.com/sylabs/singularity/blob/fba2dfa94784b872abbc42377f4bc172c6911739/internal/pkg/util/fs/files/action_scripts.go#L9
This feels a lot like something that we'd want to address in a 4.0, with thoughts about restructuring some of this, or capturing (some?) environment in a static config in the SIF. Otherwise we need to run a script in the container... mostly re-implementing the action script but without the exec of the container payload.
We can probably keep the current 'experimental' approach of using the NVIDIA_ vars from the host in 3.9, and just add the --gpu.
@tri-adam - would be good to have a chat about this at some point, especially thinking forward to 4.0.
Closing this. We are moving toward OCI mode, and NVIDIAs CDI support.