Realtime kernel causes "Assertion `!counting_period || measure_val <= adjusted_counting_period' failed to hold"
I see this error while synchronizing on a condition variable, after calling sched_setscheduler() with SCHED_FIFO on a kernel with the RT-preempt patch applied. It reproduces on Linux 4.4.53-rt66 and 4.9.15-rt12, while failing to reproduce on Linux 4.13.0.
This seems like the same problem as https://github.com/mozilla/rr/issues/2056, but involving the RT patch. Could it be that the RT patch includes the change which caused that bug? It seems to have been fixed in mainline kernels, but where can I find more information on what exactly that fix entailed?
Here's a minimal test program. Compiling this as gcc -Wall -g3 counter_bug.c -o counter_bug -lpthread -lrt and running it as sudo rr record ./counter_bug will immediately result in the assertion failure. Note the number of detected ticks is roughly the same as the argument to WAIT().
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdbool.h>
pthread_mutex_t mut;
pthread_cond_t down_cond;
pthread_cond_t up_cond;
bool flag;
volatile int counter;
#define WAIT(N) for (counter = N; counter > 0; --counter)
void *
up (void *_)
{
pthread_mutex_lock (&mut);
for (;;)
{
while (flag)
pthread_cond_wait (&down_cond, &mut);
flag = true;
pthread_cond_signal (&up_cond);
WAIT (50000);
}
}
void *
down (void *_)
{
pthread_mutex_lock (&mut);
for (;;)
{
while (! flag)
pthread_cond_wait (&up_cond, &mut);
flag = false;
pthread_cond_signal (&down_cond);
WAIT (50000);
}
}
int
main (int argc, char *argv[])
{
pthread_mutexattr_t mattr;
pthread_condattr_t cattr;
pthread_mutexattr_init (&mattr);
pthread_mutex_init (&mut, &mattr);
pthread_condattr_init (&cattr);
pthread_cond_init (&up_cond, &cattr);
pthread_cond_init (&down_cond, &cattr);
struct sched_param param;
param.sched_priority = 10;
if (sched_setscheduler (0, SCHED_FIFO, ¶m) == -1)
{
perror ("error setting scheduler");
return 1;
}
pthread_t sender;
if (pthread_create (&sender, NULL, up, NULL))
{
perror ("error creating sender thread");
return 1;
}
down (NULL);
return 2;
}
Could it be that the RT patch includes the change which caused that bug?
Unlikely but possible.
It seems to have been fixed in mainline kernels, but where can I find more information on what exactly that fix entailed?
The fix was basically to back out https://github.com/torvalds/linux/commit/cc1582c231ea.
It is possible that the RT patches somehow suppress the delivery of a perf-related interrupt at times ... but I don't know much about that stuff.
During rr recording, rr is responsible for scheduling threads and only allows one thread to run at a time, so your sched_setscheduler calls are effectively noops at best. It might make sense to make sched_setscheduler an IrregularEmulatedSyscall and stub it out in record_syscall.cc if that effectively mitigates your problem. Try that?
The fix was basically to back out torvalds/linux@cc1582c.
I see, then that's a different issue. I checked the patched source, and that code by Jin Yao is nowhere to be seen.
During rr recording, rr is responsible for scheduling threads and only allows one thread to run at a time, so your sched_setscheduler calls are effectively noops at best.
I was wondering about this -- why does rr have its own scheduler? Since rr pegs tracees to a single core, only one of them can run at a time. So couldn't you just let the kernel schedule the threads as it pleases, and record context switch events?
It might make sense to make
sched_setscheduleranIrregularEmulatedSyscalland stub it out inrecord_syscall.ccif that effectively mitigates your problem.
Yes, that mitigates the problem indeed! We can't just stub out this syscall, though, because it's not a complete no-op. The sched_priority field of param is recorded somewhere in the kernel and played back to subsequent calls to sched_getparam. What's the right way of emulating this in rr?
why does rr have its own scheduler? Since rr pegs tracees to a single core, only one of them can run at a time. So couldn't you just let the kernel schedule the threads as it pleases, and record context switch events?
- Having rr do the scheduling means we can avoid context switches in certain difficult situations, simplifying rr.
- Having rr do the scheduling means we can implement "chaos mode" which is really powerful.
It might be possible to alter that decision but it would be a lot of work from here, for not much gain AFAICT.
The sched_priority field of param is recorded somewhere in the kernel and played back to subsequent calls to sched_getparam. What's the right way of emulating this in rr?
Override sched_getparam as well and store sched_priority in a field on RecordTask.
Is sched_priority from the priorities set by setpriority()? That's confusing.
I'm also running into this assertion now on a zen 2 machine, but I'm not using an RT kernel but just the normal archlinux kernel:
$ uname -a
Linux milian-workstation 6.2.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 20 Apr 2023 16:11:55 +0000 x86_64 GNU/Linux
With the zen workaround in place, rr used to work fine but I have not used it in a while so I'm afraid I can't tell whether it's due to a kernel update or due to some other update.
Also, I see this happening at replay time, not at record time? Should I create a separate ticket for this, or is it related? The assertion is clearly the same:
[FATAL src/PerfCounters.cc:816:read_ticks()]
(task 109844 (rec:109419) at time 124172)
-> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 516212 ticks, expected no more than 514315
Tail of trace dump:
{
real_time:20145.194232 global_time:124152, event:`SYSCALLBUF_RESET' tid:109419, ticks:5964913440
}
{
real_time:20145.194238 global_time:124153, event:`SYSCALL: clock_nanosleep' (state:EXITING_SYSCALL) tid:109405, ticks:890966062
rax:0x0 rbx:0xa rcx:0xffffffffffffffff rdx:0x7fffc7235a70 rsi:0x0 rdi:0x0 rbp:0x7fffc7235a90 rsp:0x7fffc7235a30 r8:0x0 r9:0x431bde83 r10:0x0 r11:0x246 r12:0x1 r13:0x7fbc02f00000 r14:0x802 r15:0x1 rip:0x7fbc142e5db5 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe6 fs_base:0x7fbc35650240 gs_base:0x0
}
{
real_time:20145.194272 global_time:124154, event:`SYSCALLBUF_FLUSH' tid:109405, ticks:890967842
{ syscall:'clock_gettime', ret:0x0, size:0x20 }
{ syscall:'clock_gettime', ret:0x0, size:0x20 }
...
This is a generic assertion that just indicates that replay failed somehow. It's probably unrelated to the original issue.
i.e. create a separate ticket and we can discuss it more there