mpich hydra: Add assignment option for GPU

We could do something similar like SLURM did with CUDA https://slurm.schedmd.com/gres.html#GPU_Management.

Also need to investigate the assignment approach for AMD and Intel GPUs.

Jun 18 '21 20:06 yfguo

Reference, --gpus-per-proc was added in this commit https://github.com/pmodels/mpich/pull/4862/commits/2aa2a6cdf8bbce92fa3a3023efdb175a1cf2f8bc

--gpus-per-proc will set the environment variable CUDA_VISIBLE_DEVICES. Reference - https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/

Jun 22 '21 16:06 hzhou

-bind-to gpu1 is also supported, reference -- https://github.com/pmodels/mpich/blob/7f8eefd25fe603ddf0e3ef6fdcabfc829a6d8890/src/pm/hydra/tools/topo/hwloc/topo_hwloc.c#L268-L287

Jun 22 '21 16:06 hzhou

Are we looking for options such as mpiexec -bind-to {cuda1,cuda2,ze1,ze2} etc.? If hwloc supports it, then it is just a matter of adding the name/alias into topo_hwloc.c. @yfguo @abrooks98 @zhenggb72 , can you confirm?

Jun 26 '21 01:06 hzhou

Yes, we are looking for options similar to the -bind-to socket, but in this case to setup the affinity masks in the ranks based on the mapping of ranks to sockets and the GPUs connected to each socket. I do not think that the flats need to specify anything about the GPU type itself, although underneath, we will need to discover the type of GPUs that we have

Jun 28 '21 03:06 garzaran

BTW, support for Level Zero should be in the master branch or HWLOC 2.5.

Jun 28 '21 22:06 garzaran

This issue is addressed by https://github.com/pmodels/mpich/pull/5870 (at least it was supposed to). If some of the details are still missing, either reopen with specific details or open a new issue.

Oct 12 '22 02:10 hzhou