charm
charm copied to clipboard
commit 5e26e07023 causing crashes in ChaNGa
The "Order array broadcast delivery with send epoch" seems to be causing memory corruption problems in ChaNGa. On of the errors looks like this: `Processor 375 Exiting: Called CmiAbort ------------ Reason: nbor_msgs_count_ <= 0 so may be not set
[375] Stack Traceback: [375:0] ChaNGa.smp 0x8e240b CmiAsyncBroadcastFn [375:1] ChaNGa.smp 0x5b4987 TreePiece::recvBoundary(unsigned __int128, NborDir) [375:2] ChaNGa.smp 0x62b41f CkIndex_TreePiece::_call_recvBoundary_marshall96(void*, void*) [375:3] ChaNGa.smp 0x791067 CkDeliverMessageFree [375:4] ChaNGa.smp 0x7b7e51 CkLocRec::invokeEntry(CkMigratable*, void*, int, bool) `
This was implemented in PR #3614.
(Just commenting this so there's a link to the PR.)
Are there other errors that this is causing? Is there a good way of reproducing this issue?
Reproduction: so far I've only seen this on a 128x20 core run with a 8GB input dataset. I'll try on something smaller. Other symptoms include getting zeros for arguments in entry methods that are targets of broadcasts. One example is TreePiece::updateuDot(), whose second argument is an array of 32 doubles. At least one of the doubles will occasionally be zero when it should not be.
Any more luck on reproducing this at a smaller scale? How frequently does this happen? Every run? How long does it take to happen (first iteration? Deep into the execution? Is it consistent?)? (Also, what machine and machine layer did you observe this on?)
I tried reproducing this using a test program specifically written to test this part of the code (charm/tests/charm++/load_balancing/periodic_lb_broadcast_test
), and didn't see any issues.
Haven't tried on smaller runs yet. I get it to fail pretty much every run. However, the failure mode varies from run to run. Also it takes around 24 iterations before the failure happens.
Just to be clear, this isn't happening in 7.0.0 or 6.10.2, correct? I assume git bisect
led you to point to the broadcast delivery commit you mentioned as the cause?
You are correct: I used "git bisect" starting with 7.0.0 to find this commit.
Could you also provide the machine and machine layer you used for this? Our efforts to reproduce it so far haven't come up with anything.
The machine is the NASA Pleiades machine (https://www.nas.nasa.gov/hecc/resources/pleiades.html). We use the verbs-linux-x86_64 smp machine layer.
The machine is the NASA Pleiades machine (https://www.nas.nasa.gov/hecc/resources/pleiades.html). We use the verbs-linux-x86_64 smp machine layer.
Okay, thanks for the information. I don't think PPL has access to Pleiades, but we can at least try to use the same machine layer to reproduce what you're seeing.
@NK-Nikunj Can you try running a similar configuration with our broadcast test (charm/tests/charm++/load_balancing/periodic_lb_broadcast_test
)?
Sure, working on it.
I do not have access to a machine with verbs-linux but I could not reproduce the error. @trquinn is this error specific to verbs?
Here's what I've tried so far:
- Build and run
periodic_lb_broadcast_test
on 1 node to 64 nodes on Cori without experiencing any hangs - Build a fresh installation of ChaNGa with the same commit specified in the issue. Run
test_pg.param
under teststep directory.
Is it possible to try the UCX layer on the Pleiades machine? That has been a way around issues with the verbs layer in the past.
UCX has its own problems with ChaNGa (running out of registered memory segments).
I should also note: the reason we are using verbs on this machine is that the MPI implementation is UCX based and therefore fails.
I've reproduced problems with this commit on frontera using the mpi-linux-x86_64 smp build. I can point people to the failing ICs etc. on request.
It consistently fails the assertion at line 997 of ckarray.h: CkAssert(storage[(headIndex + offset) & mask] == nullptr);
This has been fixed with PR #3681