Incorrect number of autodetected GPUs
This is not that urgent/important as manually specifying the gpu count is easy. Currently HQ will auto detect all GPUs on a node, instead of just the once which we have access to.
One could use nvidia-smi to get the available gpus, or then check it manually,
with something like (mock in python as I'm not that familiar with rust.):
import os
import glob
num_total_gpus=len(glob.glob("/proc/driver/nvidia/gpus/*"))
for n in range(0,num_total_gpus):
try:
f= open("/dev/nvidia" + str(n),'r')
print("Access to /dev/nvidia" + str(n))
except IOError:
print("No perm to /dev/nvidia" + str(n))
continue
It's also possible that some additional changes are required then to get gpu placement/id correct and possible numa affinities setup (or is this something you are planning to support / is already supported, the documentation was a bit scares on this point. )when running on a partial node
Interesting. Is there some proper defined way of finding out which GPUs are actually enabled for the user? Trying to open /dev/nvidia seems a bit "fishy".
In general, we tried to make the detection as simple as possible (it's really a "best effort" attempt), mainly to avoid potential C dependencies. We could use https://docs.rs/nvml-wrapper/latest/nvml_wrapper/, but we would need to make sure that it is optional (I'm not sure how easy that is), as we are currently strongly opposed to depending on any external dependencies other than (g)libc.
Could you please provide some more context? On which cluster does this happen, what is the Nvidia/CUDA configuration used there? Is there e.g. some environment variable that could be used to distinguish which GPUs are available?
We have roughly the same setup on three clusters
- Puhti (Finnish national system) (V100, )
- Mahti (Finnish national system) (A100)
- LUMI (testing on A40:s currently )
All of which are Slurm cluster. Slurm has a setting
to constrain devices using device cgroups (this is the thing causing EPERM when opening /dev/nvidia<num>) ( https://slurm.schedmd.com/cgroup.conf.html ), other schedulers might have similar settings. If not configured to use cgroups, slurm (and probably other schedulers) will just rely on CUDA_VISIBLE_DEVICES to direct users to the correct devices. Although if possible I think sites are moving away from this as the variable is user-writable.
So a less hacky option would be to perhaps check if CUDA_VISIBLE_DEVICES is set and just use that to determine which GPUs are available to the worker? I tried to see if you can check manually for the device cgroups under /sys/fs/cgroup similarly to how it can be done for cpus and memory, but so far I did not figure out how to do it.
I'll run some more test and report back.
Thanks for the details. Using CUDA_VISIBLE_DEVICES makes sense to us, since it is also what users are used to, so they could expect that CUDA_VISIBLE_DEVICES=1,2 hq worker start will set the appropriate GPUs, but it currently doesn't.
I have opened a PR to add support for this environment variable.
Nice, thanks for the quick response. Related but perhaps a new issue. One might also want to try adding HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES to handle AMD GPUS, not 100% what Slurm or other schedulers set when allocating gpus, need to check.
I now have access to LUMI, so I'll take a look into adding support for the AMD variables.
I also noticed that on Karolina (the IT4I cluster), all GPUs can be accessed from procfs, the only separation is performed through CUDA_VISIBLE_DEVICES. That would be fine for HQ, but there is currently a slight problem, in that it sets env. variables using string IDs, while HQ only supports numeric indices at the moment. We will also add support for that.