charm
charm copied to clipboard
Reductions in CkIO::WriteSession not completing
This is with Charm version 7.0.0 and ChaNGa. I am running on frontera with the mpi build of charm on 40 nodes.
I am occasionally getting hangs during the use of Ck::IO. Sticking in CkPrintf()s narrows down the hanging to two different different places:
- sessionReady(), which is set as an InitCallback, will sometimes not get called after the CProxy_WriteSession::ckNew().
- (More commonly) the reduction triggered by contribute at the end of WriteSession::forwardData() never completes, and WriteSession::syncData() is never called.
At the moment I'm reproducing this from a checkpoint. I'm checking now if I can reproduce it from a non-checkpoint restart.
If someone has an account on frontera, I can point them at the simulation.
~~Update: I have reproduced this problem (symptom 2 in particular) with a non-checkpoint start.~~ Made a script mistake. At the moment I can ONLY reproduce this problem when restarting from a checkpoint.
My current guess is that this hang is due to bookkeeping problems around the creation and destruction of the WriteSession arrays. In particular, the number of array elements is much less than the number of PEs, so there are a lot of inactive PEs to keep track of during reductions. Furthermore, these creations/reductions/destructions are all happening immediately before ChaNGa writes a checkpoint, so I'm worried about bookkeeping across the checkpoint.
So one thing I tried: do a CkWaitQD()
just before calling CkStartCheckpoint()
. The hang goes away (so far) with this change.
Unfortunately, I'm still getting the hang after #3765 was merged.
Update: when I get to a point where CkIO hangs, I can get through the problematic CkIO session by restarting from the last checkpoint with ++ppn 1
. Furthermore, once the problematic output is done and a new checkpoint is written, I can continue from the new checkpoint with the usual ++ppn
number.
My guess is that this problem is caused by a race condition.
I can further confirm that @trquinn's workaround does get past the hang for me as well.