MPI test hang on `mpiexecjl` with SLURM
Testing MPI v0.20-dev (master) branch on Linux server via SLURM using salloc and launching Julia as julia --project -e 'using Pkg; Pkg.test()' via srun -n1 ./launch_julia.sh hangs on https://github.com/JuliaParallel/MPI.jl/blob/9a1dd861213a312fb8c16c28147d90664995ae6f/test/runtests.jl#L14-L18
although making sure that the number of processes used is consistent (salloc -n>=1, srun -n1, export JULIA_MPI_TEST_NPROCS=1), ENV["SLURM_NTASKS"] returning the expected number. Fix #559 does not yet solve the issue.
You are using srun -n1 to launch Julia. What happens when you do salloc -n5 and set export JULIA_MPI_TEST_NPROCS=4?
Do I understand correctly that your mpiexec defaults to using srun behind the curtains?
In general the MPI tests have not been verified to run within a slurm task, that would require someone adding that setup to CI.
You are using
srun -n1to launch Julia
Yes, and also export JULIA_MPI_TEST_NPROCS=1 to be consistent.
I also tested with salloc -n1 and salloc -n4 while keeping srun -n1. Also using salloc -n4, export JULIA_MPI_TEST_NPROCS=4 and srun -n4 leads to the same behaviour.
What happens if you only use salloc, but not srun? My issue is that you are starting a slurm task and within that task then run MPI.
nested srun commands are not that well support IIRC.
From https://hpc-wiki.info/hpc/FAQ_Batch_Jobs
I get "srun: Job step creation temporarily disabled", have no results and my job seems to have idled until it times out?
This ususally is caused by "nested calls" to either srun or mpirun within the same job. The second or "inner" instance of srun/mpirun tries to allocate the same resources as the "outer" one already did, and thus cannot complete.
I see your point. Running without srun however leads to the same behaviour with following message and hanging test suite
srun: Job 118705 step creation temporarily disabled, retrying (Requested nodes are busy)
Test hangs in mpiexecjl.
Test now successfully run (with exception of failing test_threads.jl -- se below) after commit https://github.com/JuliaParallel/MPI.jl/pull/564/commits/41121ad3eb4c2f986246d6872154586f142c7022 (in #564), upon copying the LocalPreferences.toml file into test/ and launching the tests as following
salloc -n2 [...]
srun --pty /bin/bash
./runme.sh
where runme.sh contains
module load openmpi
export SLURM_MPI_TYPE=pmix
export UCX_WARN_UNUSED_ENV_VARS=n
export JULIA_MPI_TEST_NPROCS=2
julia --project -e 'using Pkg; Pkg.test()'
As noted above, the only failing test is currently test_threads.jl, with following error trace:
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: ault20
Framework: pml
--------------------------------------------------------------------------
[ault20.cscs.ch:762305] PML ucx cannot be selected
[ault20.cscs.ch:762306] PML ucx cannot be selected
[ault20.cscs.ch:762297] 1 more process has sent help message help-mca-base.txt / find-available:none found
[ault20.cscs.ch:762297] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[warn] Epoll MOD(1) on fd 25 failed. Old events were 6; read change was 0 (none); write change was 2 (del); close change was 0 (none): Bad file descriptor
test_threads.jl: Error During Test at /scratch/lraess/dev/MPI_master/test/runtests.jl:47
Got exception outside of a @test
failed process: Process(`mpiexec -n 2 /users/lraess/julia_local/julia-1.7.2/bin/julia -Cnative -J/users/lraess/julia_local/julia-1.7.2/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --color=yes --startup-file=no /scratch/lraess/dev/MPI_master/test/test_threads.jl`, ProcessExited(1)) [1]
Stacktrace:
[1] pipeline_error
@ ./process.jl:531 [inlined]
[2] run(::Cmd; wait::Bool)
@ Base ./process.jl:446
[3] run
@ ./process.jl:444 [inlined]
[4] (::var"#14#16"{Cmd, String})()
@ Main /scratch/lraess/dev/MPI_master/test/runtests.jl:53
[5] withenv(f::var"#14#16"{Cmd, String}, keyvals::Pair{String, String})
@ Base ./env.jl:172
[6] (::var"#13#15"{String})(cmd::Cmd)
@ Main /scratch/lraess/dev/MPI_master/test/runtests.jl:52
[7] mpiexec(f::var"#13#15"{String}; adjust_PATH::Bool, adjust_LIBPATH::Bool)
@ MPIPreferences.System ~/.julia/packages/MPIPreferences/uArzO/src/MPIPreferences.jl:37
[8] mpiexec(f::Function)
@ MPIPreferences.System ~/.julia/packages/MPIPreferences/uArzO/src/MPIPreferences.jl:37
[9] macro expansion
@ /scratch/lraess/dev/MPI_master/test/runtests.jl:48 [inlined]
[10] top-level scope
@ ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Test/src/Test.jl:1359
[11] include(fname::String)
@ Base.MainInclude ./client.jl:451
[12] top-level scope
@ none:6
[13] eval
@ ./boot.jl:373 [inlined]
[14] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:268
[15] _start()
@ Base ./client.jl:495
Test Summary: | Error Total
test_threads.jl | 1 1
Test Summary: | Error Total
test_threads.jl | 1 1
ERROR: LoadError: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /scratch/lraess/dev/MPI_master/test/runtests.jl:47
caused by: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.
ERROR: Package MPI errored during testing
Stacktrace:
[1] pkgerror(msg::String)
@ Pkg.Types ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Pkg/src/Types.jl:68
[2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
@ Pkg.Operations ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1672
[3] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool, kwargs::Base.Pairs{Symbol, Base.TTY, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.TTY}}})
@ Pkg.API ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Pkg/src/API.jl:421
[4] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.TTY, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Pkg.API ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Pkg/src/API.jl:149
[5] test(pkgs::Vector{Pkg.Types.PackageSpec})
@ Pkg.API ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Pkg/src/API.jl:144
[6] test(; name::Nothing, uuid::Nothing, version::Nothing, url::Nothing, rev::Nothing, path::Nothing, mode::Pkg.Types.PackageMode, subdir::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Pkg.API ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Pkg/src/API.jl:164
[7] test()
@ Pkg.API ~/julia_local/julia-1.7.2/share/julia/stdlib/v1.7/Pkg/src/API.jl:158
[8] top-level scope
@ none:1