MPI implementations intercepting Signals is incompatible with Julia GC safepoint
Thanks again for your help with https://github.com/JuliaParallel/MPI.jl/issues/720 - this one is unrelated (except that issue #720 lead us to create more comprehensive unit test revealing this new, probably unrelated segfault).
Summary of this problem: a segfault occurs when GC is triggered in a multithreaded+MPI context.
How to reproduce: I have create a draft PR adding a GC.gc() call in one of MPI.jl's existing multithreaded test: see PR Request https://github.com/JuliaParallel/MPI.jl/pull/724
The draft PR is based off the most recent commit where all tests passed (Tag 0.20.8). In the output of "test-intel-linux", the salient output is
signal (11): Segmentation fault
in expression starting at /home/runner/work/MPI.jl/MPI.jl/test/test_threads.jl:18
ijl_gc_enable at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gc.c:2955
The change we made is in the file test/test_threads.jl, where we added the following if clause:
Threads.@threads for i = 1:N
reqs[N+i] = MPI.Irecv!(@view(recv_arr[i:i]), comm; source=src, tag=i)
reqs[i] = MPI.Isend(@view(send_arr[i:i]), comm; dest=dst, tag=i)
if i == 1
GC.gc()
end
end
We experience similar problems with MPICH 4.0 in our package (https://github.com/Julia-Tempering/Pigeons.jl), but not with MPICH 4.1.
Related discussions
- https://juliaparallel.org/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling
This describes a similar issue in the context of UCX. However this problem does not seem limited to UCX from our investigations so far.
- https://github.com/JuliaParallel/MPI.jl/issues/337
This describes a similar issue in the context of OpenMPI. However it seems that certain versions of MPICH and intel MPI (which is MPICH-derived) might suffer from a similar issue?
In light of these two sources, perhaps other environment variables in the style of https://github.com/JuliaParallel/MPI.jl/blob/6d513bb37da76182f2c31baff44f1f53a103295d/src/MPI.jl#L133 could be set to address this issue? I was wondering if anyone might have some suggestion on whether that's a reasonable hypothesis? Having limited MPI experience I am not sure what these environment variables might be.
Thank you so much for your time.
In a multi-threaded environment Julia uses segmentation faults on special addresses for it's safepoint implementation. If the MPI implementation intercepts signals this will cause spurious aborts.
UCX is a library that does this and so for a better experience we tell it not to. Generally Julia will handle signals for the user.
That's right, @vchuravy and this issue we are documenting here is that this issue is not just with UCX, and affects other MPI implementations, in particular some that are currently in MPI.jl's set of test cases (see "test-intel-linux" in https://github.com/JuliaParallel/MPI.jl/pull/724 showing that MPI.jl with Intel's MPI will currently crash when GC happens in a multithreaded context)
If you can figure out how to tell Intel MPI not to intercept signals we can add that as a vendor specific workaround.
We will do some research on that, thank you.
However it seems though a more principled approach would be to tell Julia to use another signal for GC coordination, since it seems that in any situation where Julia is used as a child process, GC+multithreading would trigger a crash. This leads to a kind of a Whac-A-Mole situation where the issue has to be addressed on all possible of parent processes, some of which could potentially be closed source (like the situation here).
Also it looks like that issue was reported here: https://discourse.julialang.org/t/julia-crashes-inside-threads-with-mpi/52400/5
From a quick look there is no obvious ENV-based workaround for Intel MPI.
Add to the list of MPI systems incompatible with GC+multithread: MPICH 4.0 (but MPICH 4.1!).
However it seems though a more principled approach would be to tell Julia to use another signal for GC coordination, since it seems that in any situation where Julia is used as a child process, GC+multithreading would trigger a crash
Let's be precise here. Julia does not crash, the MPI implementation is misreporting a signal as a crash.
The Julia GC safepoint needs to be very low-overhead and is implemented as a load from an address. When GC needs to be triggered Julia set's the safepoint to hot e.g. it maps the page from which the load happens as inaccessible. The OS will provide a signal to the process and Julia inspects the address to ensure that the signal was caused by the safepoint.
While there are different alternatives one could implement, this method has the lowest overhead during execution off the program, (and while I am interested in experimenting with different alternatives I don't expect these experiments to bear fruit any time soon).
some of which could potentially be closed source
I would encourage you to file a ticket with the vendor of the software.
Can you see which libfabric version the IntelMPI is using? There was a signal handler related bugfix that landed in v1.10.0rc1 (https://github.com/ofiwg/libfabric/pull/5613)
According to https://github.com/JuliaParallel/MPI.jl/blob/6d513bb37da76182f2c31baff44f1f53a103295d/.github/workflows/UnitTests.yml#L242
this particular failed test is on intelmpi-2019.9.304
@simonbyrne the latest is 2021.8.0 maybe worth an update?
Is that the same as oneAPI MPI? We already test that (thanks to @giordano)
@alexandrebouchard what version of Intel MPI are you using? And what is your libfabric version?
I am travelling this week, but let me get back to you on this soon!
Intel PSM also has the same issue as this, and requires the existence of the environment variable IPATH_NO_BACKTRACE not to crash, this is undocumented here:
https://github.com/intel/psm/blob/e5b9f1cbf432161639cb5c51d17b196c92eb4278/ipath/ipath_debug.c#L162
Similar to UCX as documented here: https://juliaparallel.org/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling
Also OpenMPI sets the same environment variable for a similar reason: https://docs.open-mpi.org/en/main/news/news-v2.x.html
Change the behavior for handling certain signals when using PSM and PSM2 libraries. Previously, the PSM and PSM2 libraries would trap certain signals in order to generate tracebacks. The mechanism was found to cause issues with Open MPI’s own error reporting mechanism. If not already set, Open MPI now sets the IPATH_NO_BACKTRACE and HFI_NO_BACKTRACE environment variables to disable PSM/PSM2’s handling these signals.
https://github.com/open-mpi/ompi/blob/4216f3fc13079b80f64c07987935345189206064/opal/runtime/opal_init.c#L98-L115
/* Very early in the init sequence -- before *ANY* MCA components are opened -- we need to disable some behavior from the PSM and PSM2 libraries (by default): at least some old versions of these libraries hijack signal handlers during their library constructors and then do not un-hijack them when the libraries are unloaded. It is a bit of an abstraction break that we have to put vendor/transport-specific code in the OPAL core, but we're out of options, unfortunately. NOTE: We only disable this behavior if the corresponding environment variables are not already set (i.e., if the user/environment has indicated a preference for this behavior, we won't override it). */
It doesn't look like setting IPATH_NO_BACKTRACE=1 is sufficient: #742 :disappointed: