rr Nested signal handlers: assertion failure

While debugging a separate issue, we ran into an assertion failure

hello[FATAL src/RecordTask.cc:1840:pop_event()] 
 (task 136189 (rec:136189) at time 254)
 -> Assertion `pending_events.back().type() == expected_type' failed to hold.

Test case:

#include <signal.h>
#include <stdlib.h>
#include <unistd.h>


static void SEGV_handler(__attribute__((unused)) int sig,
                         __attribute__((unused)) siginfo_t* si,
                         __attribute__((unused)) void* context) {
    alarm(1);
    pause();
}

static void alarm_handler(__attribute__((unused)) int sig,
                         __attribute__((unused)) siginfo_t* si,
                         __attribute__((unused)) void* context) {
    write(1, "hello", 5);
}

int main() {
    int size = 8 * 0x4000;
    void * stk_c = malloc(size);
    stack_t stk = {stk_c, 0, size};
    sigaltstack(&stk, 0);

    struct sigaction sa1, sa2;
    sa1.sa_sigaction = SEGV_handler;
    sa1.sa_flags = SA_SIGINFO | SA_ONSTACK;
    sigemptyset(&sa1.sa_mask);
    sigaction(SIGSEGV, &sa1, NULL);

    sa2.sa_sigaction = alarm_handler;
    sa2.sa_flags = SA_SIGINFO | SA_ONSTACK;
    sigemptyset(&sa2.sa_mask);
    sigaction(SIGALRM, &sa2, NULL);

    raise(SIGSEGV);
}

Jan 22 '25 00:01 mlechu

This is a gnarly one!

Here's what happens under rr:

The SIGSEGV causes a switch to the sigaltstack.
The SIGSEGV handler enters pause, which we try to syscall-buffer. That enters the syscallbuf, switching to the syscallbuf's alt stack.
The SIGALRM fires. The kernel determines (via on_sig_stack() in signal.c) that we aren't currently on the sigaltstack, and builds the new sigframe on the sigaltstack --- overwriting the previous SIGSEGV sigframe!
We return from the SIGALRM handler via sigreturn.
When we try to return from the SIGSEGV handler via sigreturn, it reuses the overwritten sigframe and chaos ensures.

Feb 06 '25 03:02 rocallahan

It is not obvious what is the best way to fix this.

Probably the simplest thing to do is forcibly set the SS_AUTODISARM flag on calls to sigaltstack(). That flag does exactly what we need here because it's designed to prevent exactly this bug. I guess there is a compatibility risk with SS_AUTODISARM, which probably centers around the case where code longjmp's out of a signal handler. But we already don't support that in rr when a signal interrupts a syscallbuf'ed syscall. So I think we should just do this.

Feb 06 '25 03:02 rocallahan

I believe https://github.com/rr-debugger/rr/pull/3322 would also address this by moving us out of the syscallbuf before signal delivery.

Feb 06 '25 04:02 Keno

Yes, it would. Thanks for reminding me --- I remembered discussing that approach, but I forgot you actually implemented it. I should take that over and get it landed.

Feb 06 '25 04:02 rocallahan

:+1: Please go ahead. It is also still deep on my todo list, but I don't think I can justify spending any time on it anytime soon.

Feb 06 '25 04:02 Keno