DLA-Future
DLA-Future copied to clipboard
Reduction to Band deadlocks in GEP
While working on #704 I stumbled upon a deadlock problem with test_gen_eigensolver
.
I report here some results of the problem investigation.
Test-cases and configurations
Apparently, of all test-cases currently tested:
https://github.com/eth-cscs/DLA-Future/blob/1332ce2e576772cd537bb7359d9a7dd3eed5d0fd/test/unit/eigensolver/test_gen_eigensolver.cpp#L49-L53
deadlocks happen just with following ones:
-
(m=16 mb=10)
-
(m=34 mb=13)
For debugging purposes, these test-cases have been run with a 2x1
grid (tested with col-major), and the problem (hopefully the same) is still reproducible.
The problem is deterministically replicable with --pika:threads=2
, while with more threads it is not deterministic but it is anyway (less but still frequently) replicable.
Problem investigation
By looking at the backtrace there seems to be a recvBcast
and a recv
stuck on two different ranks, but it was not possible to state if this is the only case or not. Perhaps, once also a sweepRecv
was listed in backtrace.
Speculation is that a lot of tasks gets scheduled and it is not straightforward to state which one is the one creating the deadlock.
Debugging tentatives
A series of investigation have been carried out. Major take home points are:
- adding a wait (that all scheduled tasks finishes) between
gen2std
andred2band
, everything works correctly - a) commenting out the
cholesky
call in the GEP, situation partially improves but it does not always reach the finish (i.e. it deadlocks the 2nd iteration instead of the first one) - b) commenting out the
gen2std
call in the GEP, it reach the finish without deadlocks - commenting out the
red2band
blocking communication (2 x allReduce in the panel) makes it reach the finish without deadlocks -
red2band
passes tests also with problematic configurations fromgen_eigensolver
- running them in a 2x1 creates a deadlock, while in a 1x2 it doesn't.
Conclusions
It is the "first time" that red2band
distributed has something scheduled before it (in tests there is a wait for all just after the call, that should be equivalent to have nothing scheduled between calls), and apparently cholesky
is not (maybe just less?) problematic, but gen2std
is.
For these reasons, with @rasolca we speculate that the problem is about inter-operability between gen2std
and red2band
. In particular, since 1x2 does not deadlock but 2x1 does, it might again be related to the blocking communications in red2band
.
Might this be due to the communicator cloning workaround we currently have in each algorithm implementation?