cray-mpich toolchain not recognized by MPIPreferences
@luraess and I have been testing MPI.jl on OLCF's Crusher which uses cray-mpich. Runs execute as independent MPI processes, we think the issue is that MPIPreferences can't recognize the cray-mpich system MPI. To reproduce:
module load rocm cray-mpich
julia> using MPIPreferences
julia> MPIPreferences.use_system_binary()
ERROR: MPI library could not be found
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] use_system_binary(; library_names::Vector{String}, mpiexec::String, abi::Nothing, export_prefs::Bool, force::Bool)
@ MPIPreferences ~/.julia/packages/MPIPreferences/MXVwb/src/MPIPreferences.jl:130
[3] use_system_binary()
@ MPIPreferences ~/.julia/packages/MPIPreferences/MXVwb/src/MPIPreferences.jl:117
[4] top-level scope
@ REPL[2]:1
cray-mpich doesn't have a libmpi.so but libmpi_cray.so and uses srun as the executable:
libmpi_cray.so.12 (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_cray.so.12
libmpi_cray.so (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_cray.so
libmpi_amd.so.12 (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_amd.so.12
libmpi_amd.so (libc6,x86-64) => /opt/cray/pe/lib64/libmpi_amd.so
Just looking for guidance or a potential feature support. We'll be happy to help make this functional and test on Crusher.
You can use
MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun")
See the documentation for more details: https://juliaparallel.org/MPI.jl/dev/reference/mpipreferences/#MPIPreferences.use_system_binary
Thanks for pointing to the refs @giordano!
One thing we can probably do is to add libmpi_cray.so to the default list of names, we already have some other vendor-specific names. Maybe also improve the error message to point out these options. Not sure instead what to do about mpiexec: at the moment that takes only a single string, not a list of options like library_names, that seems deliberate, also according to the docstring which does mention the fact you may want to change it to srun.
add
libmpi_cray.soto the default list of names
That'd be great;
improve the error message to point out these options
This as well. I have not enough insights for the rest in order to judge.
You can use
MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun")
@giordano thanks, this made hello world work and recognize the 8 GPUs on one node. There are errors in the miniapp, complaining about requesting more devices and a seg fault on a different test. We are currently debugging, but this helped already to move forward. We'll keep you posted.
I am testing Julia MPI on CSC LUMI using #master branch of MPI, and still had to execute MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun") as https://github.com/JuliaParallel/MPI.jl/pull/614 seems not to work yet. Any idea what could still make it not discoverable @giordano ?
I am testing Julia MPI on CSC LUMI using
#masterbranch of MPI, and still had to executeMPIPreferences.use_system_binary(; library_names=["libmpi_cray"], mpiexec="srun")as https://github.com/JuliaParallel/MPI.jl/pull/614 seems not to work yet. Any idea what could still make it not discoverable @giordano ?
I just ran into this myself on a Cray system. Having to specify mpiexec="srun" is expected, see https://github.com/JuliaParallel/MPI.jl/pull/614#issuecomment-1161917327, but I don't quite understand why I also had to pass library_names=["libmpi_cray"]. I'll see if I can find some time to debug this.
I just realised what was my problem: #614 isn't in any released version of MPIPreferences yet! With ]add MPIPreferences#master instead I get
julia> MPIPreferences.use_system_binary(; mpiexec="srun")
┌ Info: MPI implementation
│ libmpi = "libmpi_cray"
│ version_string = "MPI VERSION : CRAY MPICH version 8.1.4.31 (ANL base 3.4a2)\nMPI BUILD INFO : Thu Mar 18 17:07 2021 (git hash 3e74f0c)\n"
│ impl = "CrayMPICH"
│ version = v"8.1.4"
└ abi = "MPICH"
┌ Warning: The underlying MPI implementation has changed. You will need to restart Julia for this change to take effect
│ binary = "system"
│ libmpi = "libmpi_cray"
│ abi = "MPICH"
│ mpiexec = "srun"
└ @ MPIPreferences /work/ta083/ta083/mose/.julia/packages/MPIPreferences/7NYMT/src/MPIPreferences.jl:151
MPI#master isn't sufficient because that's a different package.
@simonbyrne any objections against tagging version 0.1.4?
I'm going to close this ticket as I believe the original issue was fixed by #614 (but remember you may need to change mpiexec value if you want to use srun). For any other problems with Cray MPICH, please do open a new issue!