ponyc icon indicating copy to clipboard operation
ponyc copied to clipboard

aarch64 stress test failure

Open SeanTAllen opened this issue 2 years ago • 10 comments

Built /tmp/cirrus-ci-build/build/debug/ubench

pingers 320, report-interval 300, report-count 40, initial-pings 5

time,run-ns,rate 1648088830.068952406,30053483462,96531 1648088860.070719308,29998868490,286459 1648088890.068914510,29995219769,991805 1648088920.069129568,29997469622,591083 1648088950.068988428,29996974504,576265 1648088980.068966495,29997228512,84623 1648089010.069066410,29997611190,106979 1648089040.068884807,29997205579,161170 1648089070.013283787,29941674098,110141 1648089100.069067991,29997324972,210099 1648089130.068898235,29997105042,83943 1648089160.068969918,29997565270,564882 1648089190.012050229,29940106963,403916 1648089220.068880526,29997043076,309020 1648089250.068898009,29997396797,112065 1648089280.069089718,29997789983,86448 1648089310.068937973,29996845647,503272 1648089340.068877503,29997309309,85840 1648089370.069142432,29997614828,85174 1648089400.068932256,29997163304,86704 1648089430.068925524,29996890953,1033368 1648089460.009789239,29938222516,1124596 1648089490.068918314,29997147994,82986 1648089520.068972428,29997381145,595015 1648089550.068905896,29997531979,111330 1648089580.009251495,29937630046,104222 1648089610.068869934,29996569740,104745 1648089640.068911874,29997402775,375346 1648089670.069094892,29997748357,109152 1648089700.069080519,29997284007,83407 1648089730.068951966,29996993328,305781 1648089760.068997921,29997480243,158272 1648089790.068914863,29997164121,181506 1648089820.068868261,29997370216,301929 1648089850.006849373,29935259766,185583 1648089880.006257949,29934627661,131529 1648089910.006632616,29935171228,109953 1648089940.005656209,29934107987,85651 /tmp/cirrus-ci-build/src/libponyrt/mem/pool.c:842: ponyint_pool_free: Assertion index < POOL_COUNT failed.

Backtrace: ./ubench(ponyint_assert_fail+0xe0) [0x476008] ./ubench(ponyint_pool_free+0x4c) [0x474814] ./ubench(ponyint_actor_messageq_pop+0x5c) [0x469610] ./ubench(ponyint_actor_run+0x8c) [0x467618] ./ubench() [0x477a5c] ./ubench() [0x476eac] /lib/aarch64-linux-gnu/libpthread.so.0(+0x84fc) [0xffffa0e304fc] /lib/aarch64-linux-gnu/libc.so.6(+0xd467c) [0xffffa0c7167c] Aborted (core dumped) make: *** [Makefile:214: test-stress-debug] Error 134

Exit status: 2

SeanTAllen avatar Mar 24 '22 13:03 SeanTAllen

I discussed this with @sylvanc. He agrees with my assessment that this is a time of check/time of use where that has to mean that the actor was being executed more than 1 scheduler at a time.

We took a look at the ABA protection on the scheduler queues for aarch64. It's currently doing the same thing as x86 which should be correct (but would be slower than it has to be on aarch64). That probably isn't the issue.

One thing I raised that Sylvan said sounded reasonable to look at...

It is possible that there might be a TOCTOU issue "in the backpressure system". I need to see if it is possible for an actor to be being unmuted at the same time it receives a message. I don't remember this being a problematic area but it is one that I thought of that needs consideration.

I did a review a few months ago looking for any TOCTOU issues and couldn't find any but will look again.

SeanTAllen avatar Mar 27 '22 17:03 SeanTAllen

There's a TOCTOU in ponyint_sched_unmute_senders plus how we decide when sending whether to schedule an actor or not.

For example in pony_sendv we do:

  • A1: ponyint_actor_messageq_push IF que was empty then
  • A2: if not muted
  • A3: schedule actor

In ponyint_sched_unmute_senders we do:

  • B1: Unmute actor
  • B2: schedule actor

So it is possible to have this interleaving with a muted actor that is having a message sent to it:

  • A1
  • B1
  • A2
  • A3
  • B2

SeanTAllen avatar Mar 29 '22 00:03 SeanTAllen

Interestingly, I can't find a way to have B2 in ponyint_sched_unmute_senders that doesn't result in a bad interleaving.

However, I have an idea of how to do this so that B2 isn't in ponyint_sched_unmute_senders and we still don't deadlock by never scheduling an actor that had additional work after it was unmuted.

However, really only @dipinhora and @sylvanc are capable of evaluating the idea so, I'll need to review it with Sylvan as soon as I can. It verges on "clever" so I want someone to go over it with.

General idea: https://github.com/ponylang/ponyc/pull/4075

SeanTAllen avatar Mar 29 '22 01:03 SeanTAllen

I forgot about a protection that is built in so the above TOCTOU isn't valid. So I am back to where I was when I did the last review. It all looks good. But clearly isnt on aarch64. And important point here is that we did a 32bit Arm port but never did an explicit arm64 port so, there might be something wrong in a lower level that we are depending on.

SeanTAllen avatar Mar 29 '22 17:03 SeanTAllen

I spent time last night looking at what assumptions I was making about lower level things working correctly and then trying to verify them against Verona which has some similar implementations. This lead to a conversation with @mjp41 which led to him suggesting a fix based on some TSAN results he got on snmalloc.

I'll be giving those a try. It involves making the memory ordering a bit stricter (acq_real) in our atomic compare and exchanges.

SeanTAllen avatar Mar 30 '22 15:03 SeanTAllen

There's a deep exploration and discussion of this issue in VUG #12: https://vimeo.com/695067236

SeanTAllen avatar Apr 01 '22 23:04 SeanTAllen

here's a minimal TSAN report from @mjp41:

https://gist.github.com/mjp41/3747d988209ca1d0d18a5f09ef9d67c7 as referenced 3 days ago in https://github.com/ponylang/ponyc/issues/4069#issuecomment-1083313798

SeanTAllen avatar Apr 02 '22 11:04 SeanTAllen

Given the we can't be sure we fixed this problem with any given commit, I am going to remove the "triggers release" that this deserves. This is an important issue to get a fix out for, but when my PR to fix this comes, we won't have any tests proving absense, only 48+ hours of testing that resulted in no assertion on aarch64. The real test will be the nightly stress tests runs not failing at some point in the future. Because of that, I am going to remove "triggers release" from this issue and when the fix PR goes in, not do an early release.

SeanTAllen avatar Apr 02 '22 11:04 SeanTAllen

PR to close this https://github.com/ponylang/ponyc/pull/4083

SeanTAllen avatar Apr 02 '22 12:04 SeanTAllen

This still happens from time to time on CI.

ergl avatar May 31 '22 13:05 ergl

This appears to have been fixed by https://github.com/ponylang/ponyc/pull/4159

SeanTAllen avatar Oct 26 '22 21:10 SeanTAllen