Thomas Quinn
Thomas Quinn
Haven't tried on smaller runs yet. I get it to fail pretty much every run. However, the failure mode varies from run to run. Also it takes around 24 iterations...
You are correct: I used "git bisect" starting with 7.0.0 to find this commit.
The machine is the NASA Pleiades machine (https://www.nas.nasa.gov/hecc/resources/pleiades.html). We use the verbs-linux-x86_64 smp machine layer.
UCX has its own problems with ChaNGa (running out of registered memory segments).
I should also note: the reason we are using verbs on this machine is that the MPI implementation is UCX based and therefore fails.
I've reproduced problems with this commit on frontera using the mpi-linux-x86_64 smp build. I can point people to the failing ICs etc. on request.
It consistently fails the assertion at line 997 of ckarray.h: CkAssert(storage[(headIndex + offset) & mask] == nullptr);
This has been fixed with PR #3681
~~Update: I have reproduced this problem (symptom 2 in particular) with a non-checkpoint start.~~ Made a script mistake. At the moment I can ONLY reproduce this problem when restarting from...
My current guess is that this hang is due to bookkeeping problems around the creation and destruction of the WriteSession arrays. In particular, the number of array elements is much...