rr icon indicating copy to clipboard operation
rr copied to clipboard

Nested signal handlers: assertion failure

Open mlechu opened this issue 11 months ago • 5 comments

While debugging a separate issue, we ran into an assertion failure

hello[FATAL src/RecordTask.cc:1840:pop_event()] 
 (task 136189 (rec:136189) at time 254)
 -> Assertion `pending_events.back().type() == expected_type' failed to hold. 

Test case:

#include <signal.h>
#include <stdlib.h>
#include <unistd.h>


static void SEGV_handler(__attribute__((unused)) int sig,
                         __attribute__((unused)) siginfo_t* si,
                         __attribute__((unused)) void* context) {
    alarm(1);
    pause();
}

static void alarm_handler(__attribute__((unused)) int sig,
                         __attribute__((unused)) siginfo_t* si,
                         __attribute__((unused)) void* context) {
    write(1, "hello", 5);
}

int main() {
    int size = 8 * 0x4000;
    void * stk_c = malloc(size);
    stack_t stk = {stk_c, 0, size};
    sigaltstack(&stk, 0);

    struct sigaction sa1, sa2;
    sa1.sa_sigaction = SEGV_handler;
    sa1.sa_flags = SA_SIGINFO | SA_ONSTACK;
    sigemptyset(&sa1.sa_mask);
    sigaction(SIGSEGV, &sa1, NULL);

    sa2.sa_sigaction = alarm_handler;
    sa2.sa_flags = SA_SIGINFO | SA_ONSTACK;
    sigemptyset(&sa2.sa_mask);
    sigaction(SIGALRM, &sa2, NULL);

    raise(SIGSEGV);
}

mlechu avatar Jan 22 '25 00:01 mlechu

This is a gnarly one!

Here's what happens under rr:

  • The SIGSEGV causes a switch to the sigaltstack.
  • The SIGSEGV handler enters pause, which we try to syscall-buffer. That enters the syscallbuf, switching to the syscallbuf's alt stack.
  • The SIGALRM fires. The kernel determines (via on_sig_stack() in signal.c) that we aren't currently on the sigaltstack, and builds the new sigframe on the sigaltstack --- overwriting the previous SIGSEGV sigframe!
  • We return from the SIGALRM handler via sigreturn.
  • When we try to return from the SIGSEGV handler via sigreturn, it reuses the overwritten sigframe and chaos ensures.

rocallahan avatar Feb 06 '25 03:02 rocallahan

It is not obvious what is the best way to fix this.

Probably the simplest thing to do is forcibly set the SS_AUTODISARM flag on calls to sigaltstack(). That flag does exactly what we need here because it's designed to prevent exactly this bug. I guess there is a compatibility risk with SS_AUTODISARM, which probably centers around the case where code longjmp's out of a signal handler. But we already don't support that in rr when a signal interrupts a syscallbuf'ed syscall. So I think we should just do this.

rocallahan avatar Feb 06 '25 03:02 rocallahan

I believe https://github.com/rr-debugger/rr/pull/3322 would also address this by moving us out of the syscallbuf before signal delivery.

Keno avatar Feb 06 '25 04:02 Keno

Yes, it would. Thanks for reminding me --- I remembered discussing that approach, but I forgot you actually implemented it. I should take that over and get it landed.

rocallahan avatar Feb 06 '25 04:02 rocallahan

:+1: Please go ahead. It is also still deep on my todo list, but I don't think I can justify spending any time on it anytime soon.

Keno avatar Feb 06 '25 04:02 Keno