mpich icon indicating copy to clipboard operation
mpich copied to clipboard

hydra: Add assignment option for GPU

Open yfguo opened this issue 4 years ago • 5 comments

We could do something similar like SLURM did with CUDA https://slurm.schedmd.com/gres.html#GPU_Management.

Also need to investigate the assignment approach for AMD and Intel GPUs.

yfguo avatar Jun 18 '21 20:06 yfguo

Reference, --gpus-per-proc was added in this commit https://github.com/pmodels/mpich/pull/4862/commits/2aa2a6cdf8bbce92fa3a3023efdb175a1cf2f8bc

--gpus-per-proc will set the environment variable CUDA_VISIBLE_DEVICES. Reference - https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/

hzhou avatar Jun 22 '21 16:06 hzhou

-bind-to gpu1 is also supported, reference -- https://github.com/pmodels/mpich/blob/7f8eefd25fe603ddf0e3ef6fdcabfc829a6d8890/src/pm/hydra/tools/topo/hwloc/topo_hwloc.c#L268-L287

hzhou avatar Jun 22 '21 16:06 hzhou

Are we looking for options such as mpiexec -bind-to {cuda1,cuda2,ze1,ze2} etc.? If hwloc supports it, then it is just a matter of adding the name/alias into topo_hwloc.c. @yfguo @abrooks98 @zhenggb72 , can you confirm?

hzhou avatar Jun 26 '21 01:06 hzhou

Yes, we are looking for options similar to the -bind-to socket, but in this case to setup the affinity masks in the ranks based on the mapping of ranks to sockets and the GPUs connected to each socket. I do not think that the flats need to specify anything about the GPU type itself, although underneath, we will need to discover the type of GPUs that we have

garzaran avatar Jun 28 '21 03:06 garzaran

BTW, support for Level Zero should be in the master branch or HWLOC 2.5.

garzaran avatar Jun 28 '21 22:06 garzaran

This issue is addressed by https://github.com/pmodels/mpich/pull/5870 (at least it was supposed to). If some of the details are still missing, either reopen with specific details or open a new issue.

hzhou avatar Oct 12 '22 02:10 hzhou