Nested signal handlers: assertion failure
While debugging a separate issue, we ran into an assertion failure
hello[FATAL src/RecordTask.cc:1840:pop_event()]
(task 136189 (rec:136189) at time 254)
-> Assertion `pending_events.back().type() == expected_type' failed to hold.
Test case:
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
static void SEGV_handler(__attribute__((unused)) int sig,
__attribute__((unused)) siginfo_t* si,
__attribute__((unused)) void* context) {
alarm(1);
pause();
}
static void alarm_handler(__attribute__((unused)) int sig,
__attribute__((unused)) siginfo_t* si,
__attribute__((unused)) void* context) {
write(1, "hello", 5);
}
int main() {
int size = 8 * 0x4000;
void * stk_c = malloc(size);
stack_t stk = {stk_c, 0, size};
sigaltstack(&stk, 0);
struct sigaction sa1, sa2;
sa1.sa_sigaction = SEGV_handler;
sa1.sa_flags = SA_SIGINFO | SA_ONSTACK;
sigemptyset(&sa1.sa_mask);
sigaction(SIGSEGV, &sa1, NULL);
sa2.sa_sigaction = alarm_handler;
sa2.sa_flags = SA_SIGINFO | SA_ONSTACK;
sigemptyset(&sa2.sa_mask);
sigaction(SIGALRM, &sa2, NULL);
raise(SIGSEGV);
}
This is a gnarly one!
Here's what happens under rr:
- The SIGSEGV causes a switch to the sigaltstack.
- The SIGSEGV handler enters
pause, which we try to syscall-buffer. That enters the syscallbuf, switching to the syscallbuf's alt stack. - The SIGALRM fires. The kernel determines (via
on_sig_stack()insignal.c) that we aren't currently on the sigaltstack, and builds the new sigframe on the sigaltstack --- overwriting the previous SIGSEGV sigframe! - We return from the SIGALRM handler via sigreturn.
- When we try to return from the SIGSEGV handler via sigreturn, it reuses the overwritten sigframe and chaos ensures.
It is not obvious what is the best way to fix this.
Probably the simplest thing to do is forcibly set the SS_AUTODISARM flag on calls to sigaltstack(). That flag does exactly what we need here because it's designed to prevent exactly this bug. I guess there is a compatibility risk with SS_AUTODISARM, which probably centers around the case where code longjmp's out of a signal handler. But we already don't support that in rr when a signal interrupts a syscallbuf'ed syscall. So I think we should just do this.
I believe https://github.com/rr-debugger/rr/pull/3322 would also address this by moving us out of the syscallbuf before signal delivery.
Yes, it would. Thanks for reminding me --- I remembered discussing that approach, but I forgot you actually implemented it. I should take that over and get it landed.
:+1: Please go ahead. It is also still deep on my todo list, but I don't think I can justify spending any time on it anytime soon.