singularity icon indicating copy to clipboard operation
singularity copied to clipboard

Mellanox IB/OFED Library Discovery & Binding

Open dtrudg opened this issue 3 years ago • 7 comments

Is your feature request related to a problem? Please describe. When running a multi-node application that uses Infiniband networking, the user is currently responsible for making sure that required libraries are present in the container, or bound in from the host. This requires knowledge of libraries that is below the application level, and not commonly needed outside of a container, as an HPC admin will have made the libraries available by default.

Describe the solution you'd like We should be able to discover the required libraries on the host, for automatic bind-in when the container distribution is compatible.

Describe alternatives you've considered Documentation improvements could assist, but library paths vary between systems, so it is difficult to give a simple example that will almost always work.

dtrudg avatar Jun 03 '21 20:06 dtrudg

on this topic, just posting links to potentially useful tools:

  • e4s-cl, to discover host MPI libraries, https://github.com/E4S-Project/e4s-cl
  • wi4mpi, to translate between MPI implementations , https://github.com/cea-hpc/wi4mpi

marcodelapierre avatar Feb 24 '22 03:02 marcodelapierre

Thanks for the links. I've come across e4s-cl before but not wi4mpi. Both great projects, however the latter is a bit out of scope for what we're thinking about here.

The big challenge that we have, from user feedback over the years, is that a majority of the users we've interacted with on this general issue don't want to have to know how to configure and maintain bindings, or profiles... and in many places system administrators don't want to set them up globally, and maintain them. The expectation is there should be a direct flag for singularity --mpi / --infiniband etc. and just adding that flag should make things work.

This viewpoint tends to make it feel to some people that wrappers such as e4s-cl are moving the problem, rather than solving it. I'm somewhat sympathetic to this as e.g. e4s-cl profile detect is great, but does depend on running something that satisfies various discovery conditions which users are often unsure how to satisfy.

Of course, broadly applicable auto-magic handling of MPI and OFED etc. is extremely difficult. I'm certainly not sure what the best answer is. I think e4s-cl might be something we could recommend and perhaps document as a helper. We'll certainly circle back to take a look at this, and also our user docs repository is open for contributions - https://github.com/sylabs/singularity-userdocs/

dtrudg avatar Feb 28 '22 15:02 dtrudg

Noting that I put out an appeal on the mailing list r.e. hopefully getting some more info on how people are handling this currently: https://groups.google.com/g/singularity-ce/c/iBcfIb5A9ck

dtrudg avatar Feb 28 '22 15:02 dtrudg

@dtrudg for basic automation or providing wrappers to manage interactions with containers, we have a project Singularity HPC that makes this relatively easy -> https://singularity-hpc.readthedocs.io/en/latest/ and you can add binds to specific commands for specific containers in the container.yaml, or set in the user or global config for a more global setting. If there is anything else we can do over there to help with this particular issue (even just adding more bases for MPI) please let me know! @marcodelapierre on this thread is a big contributor to the project too.

vsoch avatar Feb 28 '22 19:02 vsoch

Hi @dtrudg , @vsoch , thanks for the additional comments, and sorry it has taken me so long to get back here!

Dave, I agree, it is probably difficult to setup a one-wand-discovers-all-mpi-interconnect-configurations.

To be honest I have mentioned e4s-cl, because I have heard good reviews, but I have not got the chance yet to trial it myself, so was not super sure of what its discoverability limits are, e.g. relative to e4s-cl profile detect.

A couple of extra comments I wanted to add.

As regards who is in the better position to setup MPI for Singularity in shared clusters, I am surprised to hear certain administrators are not keen on maintaining it, because they're the ones that know the cluster best, and they are likely to have more know-how than the users to get it done. I am saying this as I would personally vote for any extra feature/documentation in Singularity in this space to be primarily targeting system administrators rather than end users. Just my own view of course.

The other comment, just to further document the issue, is a posting of two sets of configurations I devised for two of our clusters here at Pawsey Centre, just to give a couple of examples of what these setups look like.

A Cray system, with Cray-MPICH (lazily in TCL, as it comes from a software module):

setenv SINGULARITY_BINDPATH "/var/opt/cray/alps,/etc/opt/cray/wlm_detect,/opt/cray,/etc/alternatives/cray-alps,/etc/alternatives/cray-udreg,/etc/alternatives/cray-ugni,/etc/alternatives/cray-wlm_detect,/etc/alternatives/cray-xpmem,/usr/lib64/libgfortran.so.3,/usr/lib64/libxmlrpc-epi.so.0,/usr/lib64/libodbc.so.2,/usr/lib64/libltdl.so.7"

setenv SINGULARITYENV_LD_LIBRARY_PATH "/opt/cray/pe/mpt/default/gni/mpich-gnu-abi/4.9/lib:/opt/cray/xpmem/default/lib64:/opt/cray/ugni/default/lib64:/opt/cray/udreg/default/lib64:/opt/cray/pe/pmi/default/lib64:/opt/cray/alps/default/lib64:/opt/cray/wlm_detect/default/lib64:/opt/cray/pe/mpt/default/gni/mpich-gnu/4.9/lib:/usr/lib64:\$LD_LIBRARY_PATH"

A non-Cray system, with OpenMPI (this time the module is in Lua):

setenv("SINGULARITY_BINDPATH","/etc/dat.conf,/etc/libibverbs.d,/usr/lib64/libdaplofa.so.2,/usr/lib64/libdaplofa.so.2.0.0,/usr/lib64/libdat2.so.2,/usr/lib64/libdat2.so.2.0.0,/usr/lib64/libibverbs.so,/usr/lib64/libibverbs.so.1,/usr/lib64/libibverbs.so.1.0.0,/usr/lib64/libmlx5.so.1,/usr/lib64/libnl-3.so.200,/usr/lib64/libnl-3.so.200.23.0,/usr/lib64/libnl-cli-3.so.200,/usr/lib64/libnl-cli-3.so.200.23.0,/usr/lib64/libnl-genl-3.so.200,/usr/lib64/libnl-genl-3.so.200.23.0,/usr/lib64/libnl-idiag-3.so.200,/usr/lib64/libnl-idiag-3.so.200.23.0,/usr/lib64/libnl-nf-3.so.200,/usr/lib64/libnl-nf-3.so.200.23.0,/usr/lib64/libnl-route-3.so.200,/usr/lib64/libnl-route-3.so.200.23.0,/usr/lib64/librdmacm.so.1,/usr/lib64/librdmacm.so.1.0.0,/usr/lib64/libpciaccess.so.0,/usr/lib64/libpmi.so.0,/usr/lib64/libpmi2.so.0,/usr/lib64/libnuma.so.1,/usr/lib64/slurm/libslurmfull.so,/usr/lib64/libmlx4-rdmav2.so,/usr/lib64/librxe-rdmav2.so,/usr/lib64/libmlx5-rdmav2.so,/usr/lib64/libpsm2.so.2,/usr/lib64/libfabric.so.1,/usr/lib64/libpsm_infinipath.so.1,/usr/lib64/libinfinipath.so.4")

setenv("SINGULARITYENV_LD_LIBRARY_PATH","/usr/lib64:/pawsey/centos7.6/devel/cascadelake/gcc/8.3.0/openmpi-ucx/4.0.2/lib:/pawsey/centos7.6/devel/gcc/4.8.5/ucx/1.6.0/lib:/usr/lib64:/pawsey/intel/19.0.5/compilers_and_libraries/linux/mpi/intel64/libfabric/lib:/pawsey/intel/19.0.5/compilers_and_libraries/linux/mpi/intel64/lib/release:/pawsey/intel/19.0.5/compilers_and_libraries/linux/mpi/intel64/lib:$LD_LIBRARY_PATH")

Unfortunately I don't have energies to contribute more to this topic right now, but happy to contribute to the conversation when I can.

marcodelapierre avatar Jun 22 '22 13:06 marcodelapierre

Thanks for the note @marcodelapierre

How you are doing it with module env vars, is how we envisage people setting things up at present. But, it does seem relatively uncommon that people do put the BIND/LD_LIBRARY_PATH for Singularity into their MPI modules.

I am surprised to hear certain administrators are not keen on maintaining it, because they're the ones that know the cluster best, and they are likely to have more know-how than the users to get it done

I think a lot of the issue is that there are multiple MPI stacks, and inevitably people need to use older versions at some point, which tend not to have great compatibility between versions. With containers, people expect things they obtain to just work, and the MPI variant the container uses to be supported. Unfortunately that can quickly devolve into needing 10+ MPI modules to cover the different implementations, and the old versions with poorer compatibility. That can be quite a lot of work for a sysadmin team who are already quite stretched - and often it ends up in having to help people rebuild containers to work with a limited, reasonable, set of MPI modules.

The situation is getting better, with newer versions of MPI stacks that have better compatibility guarantees, and with more containerized application software compiled to use those newer versions.

Unfortunately I don't have energies to contribute more to this topic right now, but happy to contribute to the conversation when I can.

Really appreciate the insight you've been able to give. We'll continue discussion here, so just dive in whenever, if you have the opportunity and interest. Cheers!

dtrudg avatar Jun 22 '22 14:06 dtrudg

Thanks for the additional insights Dave, I find them very interesting!

But, it does seem relatively uncommon that people do put the BIND/LD_LIBRARY_PATH for Singularity into their MPI modules.

Indeed, we do add those variables in a Singularity module, not the MPI one.

As regards multiple MPI to maintain, I get the pain, too, the maximum we had to support in a cluster was 3:

  • MPICH
  • OpenMPI
  • CUDA-aware OpenMPI

Not sure whether it can be a useful comment, but we try and keep the number of MPIs to support in a cluster by:

  • Leveraging ABI-compatibility across MPICH, Cray, Intel MPI, & co as much as possible
  • Declaring Singularity support for only one version of MPICH, and one of OpenMPI
  • Providing MPI base images for researchers, using those supported versions (to encourage them to stick with those).

Thanks for the chat for now, maybe we'll be in touch again at some point :) Marco

marcodelapierre avatar Jun 22 '22 14:06 marcodelapierre