rr CI: Intermittent Failure in unexpected_exit_pid

[FATAL /cache/build/exclusive-amdci3-0/julialang/rr/src/PerfCounters.cc:630:read_ticks()]
 (task 105315 (rec:103563) at time 222)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 184055 ticks, expected no more than 177242

Note that the test timed out rather than failing because of the following assertion while dumping the register state using the emergency debugger:

[FATAL /cache/build/exclusive-amdci3-0/julialang/rr/src/Task.cc:1026:regs() errno: ENOTTY]
 (task 105315 (rec:103563) at time 222)
 -> Assertion `is_stopped' failed to hold.
Tail of trace dump:

It would be nice to make the emergency dump robust to this situation such that CI finishes rather than running into the timeout.

Apr 14 '22 05:04 Keno

https://buildkite.com/julialang/rr/builds/364#13ad5c83-f984-48f4-b38a-81702307cb14 is another one. Looking into the test, it retires conditional branches as fast as it can, so perhaps this is actually just a real overshoot. AMD is known to have terribly long NMI latencies on occasion, so not sure if there's anything we can do. @rocallahan perhaps we need to start implementing a skid backoff for replay where it optimistically assumes a small skid and if it skids past retries with a larger one?

Apr 20 '22 22:04 Keno

In theory we could, but that sounds nasty for rr replay -a because we'd have to either restart from the beginning or start taking checkpoints --- currently we don't create any. It could also be quite fiddly to implement because currently checkpointing logic is in ReplayTimeline which is a layer above ReplaySession :-(.

Apr 21 '22 04:04 rocallahan

(the logic to take periodic checkpoints, I mean.)

Apr 21 '22 04:04 rocallahan

Sigh ... Do we know anybody at AMD who could shed light on the hardware latency? Perhaps there's another magic MSR we could twiddle to get tighter interrupts.

Apr 21 '22 04:04 Keno

Alternatively we could implement a skid-override mechanism and apply it to this test. I'm not sure what the least-bad approach is.

I'm not optimistic about AMD tackling the NMI issue.

Apr 21 '22 04:04 rocallahan

CI: Intermittent Failure in unexpected_exit_pid_ns