DLA-Future icon indicating copy to clipboard operation
DLA-Future copied to clipboard

Reduction to Band deadlocks in GEP

Open albestro opened this issue 2 years ago • 1 comments

While working on #704 I stumbled upon a deadlock problem with test_gen_eigensolver.

I report here some results of the problem investigation.

Test-cases and configurations

Apparently, of all test-cases currently tested:

https://github.com/eth-cscs/DLA-Future/blob/1332ce2e576772cd537bb7359d9a7dd3eed5d0fd/test/unit/eigensolver/test_gen_eigensolver.cpp#L49-L53

deadlocks happen just with following ones:

  • (m=16 mb=10)
  • (m=34 mb=13)

For debugging purposes, these test-cases have been run with a 2x1 grid (tested with col-major), and the problem (hopefully the same) is still reproducible.

The problem is deterministically replicable with --pika:threads=2, while with more threads it is not deterministic but it is anyway (less but still frequently) replicable.

Problem investigation

By looking at the backtrace there seems to be a recvBcast and a recv stuck on two different ranks, but it was not possible to state if this is the only case or not. Perhaps, once also a sweepRecv was listed in backtrace.

Speculation is that a lot of tasks gets scheduled and it is not straightforward to state which one is the one creating the deadlock.

Debugging tentatives

A series of investigation have been carried out. Major take home points are:

  1. adding a wait (that all scheduled tasks finishes) between gen2std and red2band, everything works correctly
  2. a) commenting out the cholesky call in the GEP, situation partially improves but it does not always reach the finish (i.e. it deadlocks the 2nd iteration instead of the first one)
  3. b) commenting out the gen2std call in the GEP, it reach the finish without deadlocks
  4. commenting out the red2band blocking communication (2 x allReduce in the panel) makes it reach the finish without deadlocks
  5. red2band passes tests also with problematic configurations from gen_eigensolver
  6. running them in a 2x1 creates a deadlock, while in a 1x2 it doesn't.

Conclusions

It is the "first time" that red2band distributed has something scheduled before it (in tests there is a wait for all just after the call, that should be equivalent to have nothing scheduled between calls), and apparently cholesky is not (maybe just less?) problematic, but gen2std is.

For these reasons, with @rasolca we speculate that the problem is about inter-operability between gen2std and red2band. In particular, since 1x2 does not deadlock but 2x1 does, it might again be related to the blocking communications in red2band.

albestro avatar Nov 30 '22 14:11 albestro

Might this be due to the communicator cloning workaround we currently have in each algorithm implementation?

rasolca avatar Jan 04 '23 14:01 rasolca