MPI.jl MPI implementations intercepting Signals is incompatible with Julia GC safepoint

Thanks again for your help with https://github.com/JuliaParallel/MPI.jl/issues/720 - this one is unrelated (except that issue #720 lead us to create more comprehensive unit test revealing this new, probably unrelated segfault).

Summary of this problem: a segfault occurs when GC is triggered in a multithreaded+MPI context.

How to reproduce: I have create a draft PR adding a GC.gc() call in one of MPI.jl's existing multithreaded test: see PR Request https://github.com/JuliaParallel/MPI.jl/pull/724

The draft PR is based off the most recent commit where all tests passed (Tag 0.20.8). In the output of "test-intel-linux", the salient output is

signal (11): Segmentation fault
in expression starting at /home/runner/work/MPI.jl/MPI.jl/test/test_threads.jl:18
ijl_gc_enable at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gc.c:2955

The change we made is in the file test/test_threads.jl, where we added the following if clause:

    Threads.@threads for i = 1:N
        reqs[N+i] = MPI.Irecv!(@view(recv_arr[i:i]), comm; source=src, tag=i)
        reqs[i] = MPI.Isend(@view(send_arr[i:i]), comm; dest=dst, tag=i)
        if i == 1 
            GC.gc()
        end

    end

We experience similar problems with MPICH 4.0 in our package (https://github.com/Julia-Tempering/Pigeons.jl), but not with MPICH 4.1.

Related discussions

https://juliaparallel.org/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling

This describes a similar issue in the context of UCX. However this problem does not seem limited to UCX from our investigations so far.

https://github.com/JuliaParallel/MPI.jl/issues/337

This describes a similar issue in the context of OpenMPI. However it seems that certain versions of MPICH and intel MPI (which is MPICH-derived) might suffer from a similar issue?

In light of these two sources, perhaps other environment variables in the style of https://github.com/JuliaParallel/MPI.jl/blob/6d513bb37da76182f2c31baff44f1f53a103295d/src/MPI.jl#L133 could be set to address this issue? I was wondering if anyone might have some suggestion on whether that's a reasonable hypothesis? Having limited MPI experience I am not sure what these environment variables might be.

Thank you so much for your time.

Mar 10 '23 23:03 alexandrebouchard

In a multi-threaded environment Julia uses segmentation faults on special addresses for it's safepoint implementation. If the MPI implementation intercepts signals this will cause spurious aborts.

UCX is a library that does this and so for a better experience we tell it not to. Generally Julia will handle signals for the user.

Mar 11 '23 02:03 vchuravy

That's right, @vchuravy and this issue we are documenting here is that this issue is not just with UCX, and affects other MPI implementations, in particular some that are currently in MPI.jl's set of test cases (see "test-intel-linux" in https://github.com/JuliaParallel/MPI.jl/pull/724 showing that MPI.jl with Intel's MPI will currently crash when GC happens in a multithreaded context)

Mar 11 '23 15:03 alexandrebouchard

If you can figure out how to tell Intel MPI not to intercept signals we can add that as a vendor specific workaround.

Mar 11 '23 15:03 vchuravy

We will do some research on that, thank you.

However it seems though a more principled approach would be to tell Julia to use another signal for GC coordination, since it seems that in any situation where Julia is used as a child process, GC+multithreading would trigger a crash. This leads to a kind of a Whac-A-Mole situation where the issue has to be addressed on all possible of parent processes, some of which could potentially be closed source (like the situation here).

Mar 11 '23 15:03 alexandrebouchard

Also it looks like that issue was reported here: https://discourse.julialang.org/t/julia-crashes-inside-threads-with-mpi/52400/5

From a quick look there is no obvious ENV-based workaround for Intel MPI.

Add to the list of MPI systems incompatible with GC+multithread: MPICH 4.0 (but MPICH 4.1!).

Mar 11 '23 16:03 alexandrebouchard

However it seems though a more principled approach would be to tell Julia to use another signal for GC coordination, since it seems that in any situation where Julia is used as a child process, GC+multithreading would trigger a crash

Let's be precise here. Julia does not crash, the MPI implementation is misreporting a signal as a crash.

The Julia GC safepoint needs to be very low-overhead and is implemented as a load from an address. When GC needs to be triggered Julia set's the safepoint to hot e.g. it maps the page from which the load happens as inaccessible. The OS will provide a signal to the process and Julia inspects the address to ensure that the signal was caused by the safepoint.

While there are different alternatives one could implement, this method has the lowest overhead during execution off the program, (and while I am interested in experimenting with different alternatives I don't expect these experiments to bear fruit any time soon).

some of which could potentially be closed source

I would encourage you to file a ticket with the vendor of the software.

Mar 12 '23 00:03 vchuravy

Can you see which libfabric version the IntelMPI is using? There was a signal handler related bugfix that landed in v1.10.0rc1 (https://github.com/ofiwg/libfabric/pull/5613)

Mar 12 '23 00:03 vchuravy

According to https://github.com/JuliaParallel/MPI.jl/blob/6d513bb37da76182f2c31baff44f1f53a103295d/.github/workflows/UnitTests.yml#L242

this particular failed test is on intelmpi-2019.9.304

Mar 13 '23 19:03 alexandrebouchard

@simonbyrne the latest is 2021.8.0 maybe worth an update?

Mar 13 '23 19:03 vchuravy

Is that the same as oneAPI MPI? We already test that (thanks to @giordano)

Mar 21 '23 04:03 simonbyrne

@alexandrebouchard what version of Intel MPI are you using? And what is your libfabric version?

Mar 21 '23 04:03 simonbyrne

I am travelling this week, but let me get back to you on this soon!

Mar 21 '23 13:03 alexandrebouchard

Intel PSM also has the same issue as this, and requires the existence of the environment variable IPATH_NO_BACKTRACE not to crash, this is undocumented here:

https://github.com/intel/psm/blob/e5b9f1cbf432161639cb5c51d17b196c92eb4278/ipath/ipath_debug.c#L162

Similar to UCX as documented here: https://juliaparallel.org/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling

Jun 23 '23 19:06 vtjnash

Also OpenMPI sets the same environment variable for a similar reason: https://docs.open-mpi.org/en/main/news/news-v2.x.html

Change the behavior for handling certain signals when using PSM and PSM2 libraries. Previously, the PSM and PSM2 libraries would trap certain signals in order to generate tracebacks. The mechanism was found to cause issues with Open MPI’s own error reporting mechanism. If not already set, Open MPI now sets the IPATH_NO_BACKTRACE and HFI_NO_BACKTRACE environment variables to disable PSM/PSM2’s handling these signals.

https://github.com/open-mpi/ompi/blob/4216f3fc13079b80f64c07987935345189206064/opal/runtime/opal_init.c#L98-L115

    /* Very early in the init sequence -- before *ANY* MCA components
       are opened -- we need to disable some behavior from the PSM and
       PSM2 libraries (by default): at least some old versions of
       these libraries hijack signal handlers during their library
       constructors and then do not un-hijack them when the libraries
       are unloaded.

       It is a bit of an abstraction break that we have to put
       vendor/transport-specific code in the OPAL core, but we're
       out of options, unfortunately.

       NOTE: We only disable this behavior if the corresponding
       environment variables are not already set (i.e., if the
       user/environment has indicated a preference for this behavior,
       we won't override it). */

Jun 24 '23 10:06 giordano

It doesn't look like setting IPATH_NO_BACKTRACE=1 is sufficient: #742 :disappointed:

Jun 24 '23 16:06 giordano