CI: Intermittent Failure in unexpected_exit_pid_ns
[FATAL /cache/build/exclusive-amdci3-0/julialang/rr/src/PerfCounters.cc:630:read_ticks()]
(task 105315 (rec:103563) at time 222)
-> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 184055 ticks, expected no more than 177242
Note that the test timed out rather than failing because of the following assertion while dumping the register state using the emergency debugger:
[FATAL /cache/build/exclusive-amdci3-0/julialang/rr/src/Task.cc:1026:regs() errno: ENOTTY]
(task 105315 (rec:103563) at time 222)
-> Assertion `is_stopped' failed to hold.
Tail of trace dump:
It would be nice to make the emergency dump robust to this situation such that CI finishes rather than running into the timeout.
https://buildkite.com/julialang/rr/builds/364#13ad5c83-f984-48f4-b38a-81702307cb14 is another one. Looking into the test, it retires conditional branches as fast as it can, so perhaps this is actually just a real overshoot. AMD is known to have terribly long NMI latencies on occasion, so not sure if there's anything we can do. @rocallahan perhaps we need to start implementing a skid backoff for replay where it optimistically assumes a small skid and if it skids past retries with a larger one?
In theory we could, but that sounds nasty for rr replay -a because we'd have to either restart from the beginning or start taking checkpoints --- currently we don't create any. It could also be quite fiddly to implement because currently checkpointing logic is in ReplayTimeline which is a layer above ReplaySession :-(.
(the logic to take periodic checkpoints, I mean.)
Sigh ... Do we know anybody at AMD who could shed light on the hardware latency? Perhaps there's another magic MSR we could twiddle to get tighter interrupts.
Alternatively we could implement a skid-override mechanism and apply it to this test. I'm not sure what the least-bad approach is.
I'm not optimistic about AMD tackling the NMI issue.