aspect icon indicating copy to clipboard operation
aspect copied to clipboard

Possible MPI Deadlock in GMG preconditioner

Open gassmoeller opened this issue 2 years ago • 5 comments

As discussed yesterday with @tjhei we (@jdannberg, @RanpengLi and myself) are investigating a problem that looks like a MPI dead lock inside the GMG preconditioner. We are still working to break it down to a simple model (we currently need 64-128 processes and more than 10 hours of runtime to reproduce). I attach a log.txt and two stack traces of two different processes below.

Observations so far:

  1. The problem seems to be reproducible (running the same model stops in the same time step).
  2. The model seems to happen consistently for all models with similar parameters. We have not checked significantly different models so far.
  3. Restarting a stopped model allows us to continue running further than the previous stop point.
  4. Switching from GMG preconditioner to AMG preconditioner resolves the issue (we can run without deadlock models that would always stop for GMG).
  5. We can successfully run the same model on different hardware, compiler, and MPI versions, while using the same ASPECT and deal.II. Environment that crashes: Intel 19.1.0.166 / openmpi 4.1.1. Environment that works: GCC 9.4.0 / openmpi 4.0.3.

Analyzing the stacktrace shows the following:

  1. Both processes are stuck within AffineConstraints::distribute, when creating a new distributed vector and calling Partitioner::set_ghost_indices, which calls ConsensusAlgorithms::NBX::run (just a simplified summary of the stacktrace).
  2. However the two processes are stuck in different places that should not be simultaneously reachable:
  • One is stuck in the MPI_Barrier in the destructor of the ScopedLock created here. This suggests to me this process is done with the function and in the process of jumping back to the calling function.
  • The other is stuck in MPI_Test and the only MPI_Test I found in the algorithm is inside all_remotely_originated_receives_are_completed() here.
  • However, the MPI_Test in question checks if all processes have completed the MPI_IBarrier that is placed further up in the function. So the first process must have passed this test already, while the second one is stuck. In other words, it looks like the return value of the MPI_Test makes some processes believe everyone has passed the MPI_IBarrier, while some others are not notified that the MPI_IBarrier has been passed by every process (and therefore wait endlessly for completion). I am not sure how this can happen (is it possible to get here is one process throws an exception inside the function?).

Things we test at the moment:

  • We are in the process of running the model again on the failing hardware with Intel 19.1.0.166 / openmpi 4.0.3 to see if something in openmpi 4.1.1 is causing the issue. We will afterwards look into using gcc instead of Intel compiler.
  • We will try running the failing model with the latest deal.II instead of v9.4.0.

Other ideas to test are appreciated. I will update this issue when we find out more.

test.48589940.log.txt 126267.txt 126279.txt

gassmoeller avatar Oct 13 '22 18:10 gassmoeller

Reopening. We do not know yet if #4986 fixes this issue. Github automation was a bit overzealous.

gassmoeller avatar Oct 15 '22 13:10 gassmoeller

I have no suggestion at this point (agree will all of your points). Maybe we are having a problem with an exception.

Maybe @peterrum has an idea...

tjhei avatar Oct 17 '22 02:10 tjhei

Additional information today:

  • Running the model that fails on our cluster (with MPI 4.1.1) again with MPI 4.0.3 (the version that succeeds on our workstation) still fails. Thus, it is likely not related to the MPI version. It could still be that MPI is compiled with different parameters, and of course different hardware on the cluster vs the workstation.
  • We have confirmed that the process that is stuck in the destructor of the ScopedLock is processing an exception. In other words, because the destructor of the ScopedLock requires MPI communication we run into the deadlock. We likely need similar guards for that as here. I have a change ready that we are currently testing that I will turn into a PR for deal.II when the tests are successful.
  • This does not yet tell us why the exception is thrown, it only removes the deadlock.

gassmoeller avatar Oct 17 '22 16:10 gassmoeller

  • We likely need similar guards for that as here.

Agreed. That is what I would do as well. Let me know if you need help.

tjhei avatar Oct 17 '22 18:10 tjhei

This does not yet tell us why the exception is thrown, it only removes the deadlock.

Any idea where the exception is thrown? I guess it the if the collective mutex cause the a deadlock, this means that the content guarded by the mutex causes a problem!?

peterrum avatar Oct 17 '22 19:10 peterrum

Any idea where the exception is thrown?

My current best guess is somewhere between here and here (lines 1606-1628). Some processes clearly reach the MPI_IBarrier inside signal_finish, while others do not (or at least some processes are later waiting for someone to reach the IBarrier).

Because of all the characteristics above I suspect a resource leak (reproducible at a fixed timestep, but when restarting we can progress past the crashed timestep and get stuck later).

I will let you know once I have some more information, we are currently testing dealii/dealii#14356 to see if it improves our error message.

gassmoeller avatar Oct 17 '22 20:10 gassmoeller

Using dealii/dealii#14364 we made some progress in tracking down the problem. Still working on digging deeper, but here is what we know now:

The one exception we have tracked down (there seems to be another one that still happens) is a boost exception that is raised from inside Utilities::unpack in this line.

The exception we saw is:

----------------------------------------------------
Exception on rank 20 on processing:
no read access: iostream error
Aborting!
----------------------------------------------------

Which is raised from inside Boost. @RanpengLi and I went in with gdb and found that apparently an empty std::vector<char> (length zero) is about to be unpacked into a std::vector<std::pair<unsigned int, unsigned int>. It looked like both cbegin and cend the input arguments to unpack are null pointers. I added a fix to Utilities::unpack to just return a default constructed object if cbegin == cend (which should be safe even with null pointers). This seems to have fixed most exceptions, but at least one process is still throwing an exception and we are currently tracking that down.

I attach the backtrace in case someone has an idea. exception_backtrace.txt

Currently open questions:

  • I am not sure if the bug here is that an empty message (instead of actual data) is being sent and unpacked, or if we just not handle empty messages well. I also dont know why this is a problem on our cluster, but not on our workstation, despite identical gcc and mpi versions. They even use nearly the same hardware (AMD EPYC) so it shouldnt be an optimization flag issue.
  • It looks like boost may not like the initialization in this line if cbegin and/or cend are null. Should we do something about that? What should unpack do if there is no data to unpack? Return a default constructed empty object? Or crash?
  • We are still investigating why even with the fix there is at least one process that still throws. Will report back once I know.

gassmoeller avatar Nov 01 '22 20:11 gassmoeller

I went in with gdb and found that apparently an empty std::vector<char> (length zero) is about to be unpacked into a std::vector<std::pair<unsigned int, unsigned int>. It looks like boost may not like the initialization in this line

Yes, de-referencing an empty range is undefined behavior. I will play with this a little bit and report back.

tjhei avatar Nov 03 '22 13:11 tjhei

I tried compressing an empty std::vector<std::pair> but it will compress to 53 bytes not to 0 bytes. So, to me it sounds like receiving an empty buffer should be a bug.

tjhei avatar Nov 03 '22 14:11 tjhei

@gassmoeller I'm not sure this is fixed. Do we need to keep this open here? Is there anything we can/need to do in ASPECT?

bangerth avatar Jul 13 '23 23:07 bangerth

I do not know if it is fixed, but @RanpengLi reported that after a system update our cluster doesnt show the problem anymore so it might have been a configuration problem (or an interaction of the library versions we used). We can close the issue for now.

gassmoeller avatar Jul 14 '23 03:07 gassmoeller