singularity icon indicating copy to clipboard operation
singularity copied to clipboard

--gpus flag

Open dtrudg opened this issue 4 years ago • 4 comments

Describe the solution you'd like

The --gpus flag for the nvidia docker runtime will configure the nvidia-container-cli setup so that e.g.

--gpus "all,capabilities=utility"

is equivalent to NVIDIA_VISIBLE_DEVICES=all, NVIDIA_DRIVER_CAPABILITIES=utility.

It would be an advantage to be able to use --gpus rather than requiring the individual environment variables to be set. A matching SINGULARITY_GPUS env var would be appropriate.

Note that with #361 we would read the NVIDIA_ env vars from the container instead of the host, so --gpus / SINGULARITY_GPUS are required to override.

Edit - as noted in discussion below, because we aren't yet defaulting to -nvccli, it wouldn't be very friendly for --gpus not to apply to SingularityCE's own GPU setup. We need to handle device binding / masking in that case - but we could ignore the capabilities portion, and perhaps only support numeric GPU IDs, not MIG UUIDs etc.

dtrudg avatar Oct 11 '21 15:10 dtrudg

This is still worth pursuing prior to https://github.com/sylabs/singularity/issues/361#issuecomment-940474009

The --gpus / SINGULARITY_GPUS could override host NVIDIA_xxx env vars for this purpose.

dtrudg avatar Oct 11 '21 22:10 dtrudg

@dtrudg this looks like low hanging fruit so maybe I can help! Is this still desired, and if so, could you give a quick summary of what the implementation should do? E.g.,

  1. add a --gpus flag to run/exec/shell
  2. given presence of the flag, set those nvidia envars?
  3. and the same flag should be triggered with SINGULARITY_GPUS ?

And this

Note that with https://github.com/sylabs/singularity/issues/361 we would read the NVIDIA_ env vars from the container instead of the host, so --gpus / SINGULARITY_GPUS are required to override.

Should this be tackled after this first set of envars are added or at the same time? And if at the same time, could we chat about what that means? I'm not familiar with the current interaction with nvidia gpus!

vsoch avatar May 05 '22 19:05 vsoch

Hi @vsoch - it is, unfortunately, not as easy as it first seems.

For the case where the experimental --nvccli flag is used, and nvidia-container-cli sets up the GPUs in the container environment, we can just set the correct NVIDIA_xxx vars. nvidia-container-cli will then do the right thing based on those.

The catch is that --nvccli is still not our default in 3.10... most people will be using --nv only, where Singularity code is responsible for the binding of GPU devices. There we would need to make sure our own code can interpret the value of a gpus flag. It would then have to mask, or bind GPU devices from/into the container as appropriate. Currently we bind all devices, so there's a fair amount of logic that needs to go into this.

dtrudg avatar May 05 '22 20:05 dtrudg

I should say explicitly... if you'd like to take this on further... please reach out on Slack or similar and I can demonstrate some of the issues to you. I don't want to put you off completely here :-)

dtrudg avatar May 05 '22 21:05 dtrudg

@dtrudg from my perspective, I would like to argue against implementing a --gpus flag and would advocate for supporting CDI devices through the --device flag (or similar) as is done for podman.

elezar avatar Feb 20 '23 10:02 elezar

Agreed. Seems clear that this would be better implemented via --device with CDI in #1394 and potentially #1395 ... given that things are moving in that direction generally across runtimes.

dtrudg avatar Mar 01 '23 15:03 dtrudg