MPI.jl icon indicating copy to clipboard operation
MPI.jl copied to clipboard

cray-mpich toolchain not recognized by MPIPreferences

Open williamfgc opened this issue 3 years ago • 6 comments

@luraess and I have been testing MPI.jl on OLCF's Crusher which uses cray-mpich. Runs execute as independent MPI processes, we think the issue is that MPIPreferences can't recognize the cray-mpich system MPI. To reproduce:

module load rocm cray-mpich

julia> using MPIPreferences

julia> MPIPreferences.use_system_binary()
ERROR: MPI library could not be found
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] use_system_binary(; library_names::Vector{String}, mpiexec::String, abi::Nothing, export_prefs::Bool, force::Bool)
   @ MPIPreferences ~/.julia/packages/MPIPreferences/MXVwb/src/MPIPreferences.jl:130
 [3] use_system_binary()
   @ MPIPreferences ~/.julia/packages/MPIPreferences/MXVwb/src/MPIPreferences.jl:117
 [4] top-level scope
   @ REPL[2]:1

cray-mpich doesn't have a libmpi.so but libmpi_cray.so and uses srun as the executable:

       libmpi_cray.so.12 (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_cray.so.12
	libmpi_cray.so (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_cray.so
	libmpi_amd.so.12 (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_amd.so.12
	libmpi_amd.so (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_amd.so

Just looking for guidance or a potential feature support. We'll be happy to help make this functional and test on Crusher.

williamfgc avatar Jun 20 '22 14:06 williamfgc

You can use

MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun")

See the documentation for more details: https://juliaparallel.org/MPI.jl/dev/reference/mpipreferences/#MPIPreferences.use_system_binary

giordano avatar Jun 20 '22 15:06 giordano

Thanks for pointing to the refs @giordano!

luraess avatar Jun 20 '22 15:06 luraess

One thing we can probably do is to add libmpi_cray.so to the default list of names, we already have some other vendor-specific names. Maybe also improve the error message to point out these options. Not sure instead what to do about mpiexec: at the moment that takes only a single string, not a list of options like library_names, that seems deliberate, also according to the docstring which does mention the fact you may want to change it to srun.

giordano avatar Jun 20 '22 15:06 giordano

add libmpi_cray.so to the default list of names

That'd be great;

improve the error message to point out these options

This as well. I have not enough insights for the rest in order to judge.

luraess avatar Jun 20 '22 18:06 luraess

You can use

MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun")

@giordano thanks, this made hello world work and recognize the 8 GPUs on one node. There are errors in the miniapp, complaining about requesting more devices and a seg fault on a different test. We are currently debugging, but this helped already to move forward. We'll keep you posted.

williamfgc avatar Jun 21 '22 09:06 williamfgc

I am testing Julia MPI on CSC LUMI using #master branch of MPI, and still had to execute MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun") as https://github.com/JuliaParallel/MPI.jl/pull/614 seems not to work yet. Any idea what could still make it not discoverable @giordano ?

luraess avatar Aug 03 '22 12:08 luraess

I am testing Julia MPI on CSC LUMI using #master branch of MPI, and still had to execute MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun") as https://github.com/JuliaParallel/MPI.jl/pull/614 seems not to work yet. Any idea what could still make it not discoverable @giordano ?

I just ran into this myself on a Cray system. Having to specify mpiexec="srun" is expected, see https://github.com/JuliaParallel/MPI.jl/pull/614#issuecomment-1161917327, but I don't quite understand why I also had to pass library_names=["libmpi_cray"]. I'll see if I can find some time to debug this.

giordano avatar Sep 09 '22 09:09 giordano

I just realised what was my problem: #614 isn't in any released version of MPIPreferences yet! With ]add MPIPreferences#master instead I get

julia> MPIPreferences.use_system_binary(; mpiexec="srun")
┌ Info: MPI implementation
│   libmpi = "libmpi_cray"
│   version_string = "MPI VERSION    : CRAY MPICH version 8.1.4.31 (ANL base 3.4a2)\nMPI BUILD INFO : Thu Mar 18 17:07 2021 (git hash 3e74f0c)\n"
│   impl = "CrayMPICH"
│   version = v"8.1.4"
└   abi = "MPICH"
┌ Warning: The underlying MPI implementation has changed. You will need to restart Julia for this change to take effect
│   binary = "system"
│   libmpi = "libmpi_cray"
│   abi = "MPICH"
│   mpiexec = "srun"
└ @ MPIPreferences /work/ta083/ta083/mose/.julia/packages/MPIPreferences/7NYMT/src/MPIPreferences.jl:151

MPI#master isn't sufficient because that's a different package.

@simonbyrne any objections against tagging version 0.1.4?

giordano avatar Sep 16 '22 14:09 giordano

I'm going to close this ticket as I believe the original issue was fixed by #614 (but remember you may need to change mpiexec value if you want to use srun). For any other problems with Cray MPICH, please do open a new issue!

giordano avatar Sep 19 '22 14:09 giordano