charm icon indicating copy to clipboard operation
charm copied to clipboard

commit 5e26e07023 causing crashes in ChaNGa

Open trquinn opened this issue 2 years ago • 7 comments

The "Order array broadcast delivery with send epoch" seems to be causing memory corruption problems in ChaNGa. On of the errors looks like this: `Processor 375 Exiting: Called CmiAbort ------------ Reason: nbor_msgs_count_ <= 0 so may be not set

[375] Stack Traceback: [375:0] ChaNGa.smp 0x8e240b CmiAsyncBroadcastFn [375:1] ChaNGa.smp 0x5b4987 TreePiece::recvBoundary(unsigned __int128, NborDir) [375:2] ChaNGa.smp 0x62b41f CkIndex_TreePiece::_call_recvBoundary_marshall96(void*, void*) [375:3] ChaNGa.smp 0x791067 CkDeliverMessageFree [375:4] ChaNGa.smp 0x7b7e51 CkLocRec::invokeEntry(CkMigratable*, void*, int, bool) `

trquinn avatar Sep 20 '22 03:09 trquinn

This was implemented in PR #3614.

(Just commenting this so there's a link to the PR.)

rbuch avatar Sep 20 '22 03:09 rbuch

Are there other errors that this is causing? Is there a good way of reproducing this issue?

rbuch avatar Sep 20 '22 03:09 rbuch

Reproduction: so far I've only seen this on a 128x20 core run with a 8GB input dataset. I'll try on something smaller. Other symptoms include getting zeros for arguments in entry methods that are targets of broadcasts. One example is TreePiece::updateuDot(), whose second argument is an array of 32 doubles. At least one of the doubles will occasionally be zero when it should not be.

trquinn avatar Sep 20 '22 17:09 trquinn

Any more luck on reproducing this at a smaller scale? How frequently does this happen? Every run? How long does it take to happen (first iteration? Deep into the execution? Is it consistent?)? (Also, what machine and machine layer did you observe this on?)

I tried reproducing this using a test program specifically written to test this part of the code (charm/tests/charm++/load_balancing/periodic_lb_broadcast_test), and didn't see any issues.

rbuch avatar Sep 27 '22 16:09 rbuch

Haven't tried on smaller runs yet. I get it to fail pretty much every run. However, the failure mode varies from run to run. Also it takes around 24 iterations before the failure happens.

trquinn avatar Sep 27 '22 17:09 trquinn

Just to be clear, this isn't happening in 7.0.0 or 6.10.2, correct? I assume git bisect led you to point to the broadcast delivery commit you mentioned as the cause?

rbuch avatar Oct 11 '22 16:10 rbuch

You are correct: I used "git bisect" starting with 7.0.0 to find this commit.

trquinn avatar Oct 11 '22 19:10 trquinn

Could you also provide the machine and machine layer you used for this? Our efforts to reproduce it so far haven't come up with anything.

rbuch avatar Nov 01 '22 16:11 rbuch

The machine is the NASA Pleiades machine (https://www.nas.nasa.gov/hecc/resources/pleiades.html). We use the verbs-linux-x86_64 smp machine layer.

trquinn avatar Nov 01 '22 16:11 trquinn

The machine is the NASA Pleiades machine (https://www.nas.nasa.gov/hecc/resources/pleiades.html). We use the verbs-linux-x86_64 smp machine layer.

Okay, thanks for the information. I don't think PPL has access to Pleiades, but we can at least try to use the same machine layer to reproduce what you're seeing.

@NK-Nikunj Can you try running a similar configuration with our broadcast test (charm/tests/charm++/load_balancing/periodic_lb_broadcast_test)?

rbuch avatar Nov 01 '22 17:11 rbuch

Sure, working on it.

NK-Nikunj avatar Nov 01 '22 17:11 NK-Nikunj

I do not have access to a machine with verbs-linux but I could not reproduce the error. @trquinn is this error specific to verbs?

Here's what I've tried so far:

  1. Build and run periodic_lb_broadcast_test on 1 node to 64 nodes on Cori without experiencing any hangs
  2. Build a fresh installation of ChaNGa with the same commit specified in the issue. Run test_pg.param under teststep directory.

NK-Nikunj avatar Nov 22 '22 07:11 NK-Nikunj

Is it possible to try the UCX layer on the Pleiades machine? That has been a way around issues with the verbs layer in the past.

ericjbohm avatar Nov 22 '22 17:11 ericjbohm

UCX has its own problems with ChaNGa (running out of registered memory segments).

trquinn avatar Nov 22 '22 20:11 trquinn

I should also note: the reason we are using verbs on this machine is that the MPI implementation is UCX based and therefore fails.

trquinn avatar Nov 23 '22 23:11 trquinn

I've reproduced problems with this commit on frontera using the mpi-linux-x86_64 smp build. I can point people to the failing ICs etc. on request.

trquinn avatar Dec 15 '22 01:12 trquinn

It consistently fails the assertion at line 997 of ckarray.h: CkAssert(storage[(headIndex + offset) & mask] == nullptr);

trquinn avatar Dec 15 '22 21:12 trquinn

This has been fixed with PR #3681

trquinn avatar Jan 17 '23 19:01 trquinn