charm icon indicating copy to clipboard operation
charm copied to clipboard

Reductions in CkIO::WriteSession not completing

Open trquinn opened this issue 2 years ago • 5 comments

This is with Charm version 7.0.0 and ChaNGa. I am running on frontera with the mpi build of charm on 40 nodes.

I am occasionally getting hangs during the use of Ck::IO. Sticking in CkPrintf()s narrows down the hanging to two different different places:

  1. sessionReady(), which is set as an InitCallback, will sometimes not get called after the CProxy_WriteSession::ckNew().
  2. (More commonly) the reduction triggered by contribute at the end of WriteSession::forwardData() never completes, and WriteSession::syncData() is never called.

At the moment I'm reproducing this from a checkpoint. I'm checking now if I can reproduce it from a non-checkpoint restart.

If someone has an account on frontera, I can point them at the simulation.

trquinn avatar Dec 08 '22 00:12 trquinn

~~Update: I have reproduced this problem (symptom 2 in particular) with a non-checkpoint start.~~ Made a script mistake. At the moment I can ONLY reproduce this problem when restarting from a checkpoint.

trquinn avatar Dec 08 '22 06:12 trquinn

My current guess is that this hang is due to bookkeeping problems around the creation and destruction of the WriteSession arrays. In particular, the number of array elements is much less than the number of PEs, so there are a lot of inactive PEs to keep track of during reductions. Furthermore, these creations/reductions/destructions are all happening immediately before ChaNGa writes a checkpoint, so I'm worried about bookkeeping across the checkpoint. So one thing I tried: do a CkWaitQD() just before calling CkStartCheckpoint(). The hang goes away (so far) with this change.

trquinn avatar Dec 27 '22 17:12 trquinn

Unfortunately, I'm still getting the hang after #3765 was merged.

trquinn avatar Jan 18 '24 05:01 trquinn

Update: when I get to a point where CkIO hangs, I can get through the problematic CkIO session by restarting from the last checkpoint with ++ppn 1. Furthermore, once the problematic output is done and a new checkpoint is written, I can continue from the new checkpoint with the usual ++ppn number.

My guess is that this problem is caused by a race condition.

trquinn avatar Jul 14 '24 04:07 trquinn

I can further confirm that @trquinn's workaround does get past the hang for me as well.

bwkeller avatar Jul 24 '24 19:07 bwkeller