adjusted_counting_period assertion failures on Fedora 5.19-rc8 on Ryzen 6800U
I am experiencing 27 rr Test failures after merging support for Ryzen 6000 series (see PR #3351)
The processor is specifically 6800U. This is a custom Linux kernel -- essentially 5.19-rc8. I experience similar test failures on Fedora 5.18.xx series also.
Here are the failures:
52:x86/chew_cpu_cpuid
152:exit_with_syscallbuf_signal
418:no_mask_timeslice
419:no_mask_timeslice-no-syscallbuf
456:x86/pkeys
457:x86/pkeys-no-syscallbuf
816:unexpected_exit_pid_ns
872:async_kill_with_syscallbuf2
1001:ignored_async_usr1-no-syscallbuf
1070:reverse_continue_breakpoint
1071:reverse_continue_breakpoint-no-syscallbuf
1088:rseq
1089:rseq-no-syscallbuf
1310:record_replay
1394:term_trace_cpu
1395:term_trace_cpu-no-syscallbuf
1467:x86/chew_cpu_cpuid-32-no-syscallbuf
1870:x86/pkeys-32
1871:x86/pkeys-32-no-syscallbuf
2466:overflow_branch_counter-32
2467:overflow_branch_counter-32-no-syscallbuf
2484:reverse_continue_breakpoint-32
2485:reverse_continue_breakpoint-32-no-syscallbuf
2488:reverse_continue_process_signal-32
2493:reverse_step_long-32-no-syscallbuf
2502:rseq-32
2503:rseq-32-no-syscallbuf
Interestingly 26 of the failures are due to the puzzling assertion below. (Of course the number of ticks are different depending on the failed test)
[FATAL src/PerfCounters.cc:810:read_ticks()]
(task 133844 (rec:133838) at time 131)
-> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 508733 ticks, expected no more than 498183
This assertion has come up previously in various Zen related issues.
Please note that I tested with sudo sysctl -w kernel.perf_cpu_time_max_percent=100 not that it seemed to make any difference though...
Interestingly 26 of the failures are due to the puzzling assertion below. (Of course the number of ticks are different depending on the failed test)
Sigh, SMI interrupt latency appears to be getting worse on the newer Zen chips.
What is interesting is that these tests seem to be reliably failing (as far as I can tell).
If SMI latency were the core issue shouldn't we start to see different tests fail on different runs? Or is it that these tests somehow reliably create long SMI latency issues?
With Alder Lake requiring a higher skid region (for example) I'm generally worrying about the flakiness on perf counters with the newer x86 processors.
Is aarch64 (M1 / Neoverse) super reliable for instance with their counters so far? Just wondering -- I'm curious to know.
SMI? This is the PMI no?
Sorry, I meant NMI. My understanding is that on AMD the PMIs take a long trip through the interrupt controller and generates an NMI before coming back to the core, while on Intel they're more closely coupled, but I didn't look too closely at the details.
With Alder Lake requiring a higher skid region (for example) I'm generally worrying about the flakiness on perf counters with the newer x86 processors.
Higher PMI latencies are a performance and quality of implementation issue but they're not a correctness issue. Worth noting that the amount I bumped the Alder Lake skid size by is 0.25% of the skid size of Zen so Intel's issue while unfortunate is not really a big deal.
Interestingly 26 of the failures are due to the puzzling assertion below. (Of course the number of ticks are different depending on the failed test)
[FATAL src/PerfCounters.cc:810:read_ticks()] (task 133844 (rec:133838) at time 131) -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 508733 ticks, expected no more than 498183
An additional ten thousand ticks here would be doubling the (already enormous) skid size :(
An additional ten thousand ticks here would be doubling the (already enormous) skid size :(
Can this change be made processor depending or depending on a preprocessor flag?
Any change here would be keyed on the microarchitecture.