MPI.jl icon indicating copy to clipboard operation
MPI.jl copied to clipboard

UCX ignores UCX_ERROR_SIGNALS set by MPI.jl

Open vchuravy opened this issue 4 years ago • 1 comments

This manifests itself as a warning that the UCX_ERROR_SIGNALS variable is unused and leads to spurious aborts due to Julia's use of SIGSEV

[1595151936.606713] [node0022:65473:0]         parser.c:1491 UCX  WARN  unused env variable: UCX_ERROR_SIGNALS (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
...
node0022:65475:1:65488] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3)
==== backtrace ====
    0  /home/software/spack/ucx/1.6.0-rxquvh64m7gt2oivvh4drm2rlquf4lf7/lib/libucs.so.0(+0x260f0) [0x20006df560f0]
    1  /home/software/spack/ucx/1.6.0-rxquvh64m7gt2oivvh4drm2rlquf4lf7/lib/libucs.so.0(+0x26520) [0x20006df56520]
    2  [0x2000000504d8]
    3  [0x2000ff2b4400]
    4  /home/software/julia/1.3.0/bin/../lib/libjulia.so.1(+0x211fc8) [0x200000281fc8]
    5  /home/software/julia/1.3.0/bin/../lib/libjulia.so.1(+0xc53e8) [0x2000001353e8]
    6  /home/software/julia/1.3.0/bin/../lib/libjulia.so.1(+0xc5974) [0x200000135974]

This is on MPI.jl 1.4.0 with OpenMPI 3.1.4 + UCX + pmi2 I suspect that the use of pmi2 is causing this since I can set export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE", before I do srun --mpi=pmi2 julia, and I do not get an unused env variable error nor spurious seqfaults. `

vchuravy avatar Jul 19 '20 09:07 vchuravy

This is on MPI.jl 1.4.0 with OpenMPI 3.1.4 + UCX + pmi2

I assume you mean MPI.jl 0.14.0? I wonder if a newer UCX would work?

simonbyrne avatar Jul 20 '20 04:07 simonbyrne