backtrace-rs icon indicating copy to clipboard operation
backtrace-rs copied to clipboard

Add test for backtracing after stack overflow.

Open losfair opened this issue 3 years ago • 13 comments

This PR adds a test that triggers https://github.com/rust-lang/backtrace-rs/issues/356 .

For me, currently the added test fails on macOS with SIGSEGV but not Linux:

     Running target/release/deps/stack_overflow-8c961da828d1b0c2
Before stack overflow
Backtrace begin
error: test failed, to rerun pass '--test stack-overflow'

Interestingly, if the line println!("Before stack overflow"); is removed then the test works on macOS - looks like there's some randomness here.

losfair avatar Jul 06 '20 17:07 losfair

CI fails at the same location on AArch64 Linux. Interesting.

My Rust version:

rustc 1.46.0-nightly (16957bd4d 2020-06-30)
binary: rustc
commit-hash: 16957bd4d3a5377263f76ed74c572aad8e4b7e59
commit-date: 2020-06-30
host: x86_64-apple-darwin
release: 1.46.0-nightly
LLVM version: 10.0

My macOS version is 10.15.5.

losfair avatar Jul 06 '20 17:07 losfair

The segfault may be QEMU itself, so it may not be an actual issue. It looks like macOS passes on CI though?

alexcrichton avatar Jul 06 '20 21:07 alexcrichton

It seems that the segfault has something to do with --release . When testing it locally (and running only the stack-overflow test) the error occurs only when the --release flag is enabled.

A full cargo test --release fails before stack-overflow on a earlier test.

losfair avatar Jul 07 '20 15:07 losfair

Ah so release mode tests are expected because they're pretty sensitive to debuginfo. I haven't put a ton of effort into getting them to run on CI yet.

For the macos failure, thanks for the tip! I can reproduce, although bizarrely only on nightly and also only in release. I'm able to get a stack trace, however, by enabling core dumps. The faulting thread looks like:

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff67da43cb libunwind.dylib`libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame(unsigned int, unsigned long long, libunwind::LocalAddressSpace&, libunwind::Registers_x86_64&) + 107
    frame #1: 0x00007fff67da42d8 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step() + 104
    frame #2: 0x00007fff67da8132 libunwind.dylib`_Unwind_Backtrace + 65
    frame #3: 0x000000010630fed0 backtrace`backtrace::capture::Backtrace::create::hdc40bc63a6bd0a27 + 128
    frame #4: 0x000000010630fe45 backtrace`backtrace::capture::Backtrace::new_unresolved::he08fc6a3f7e860ab + 21
    frame #5: 0x000000010630f524 backtrace`backtrace::test::trap_handler::h04106c99381ffc66 + 84
    frame #6: 0x00007fff67d6a5fd libsystem_platform.dylib`_sigtramp + 29
    frame #7: 0x000000010630f1f5 backtrace`backtrace::test::f::hc7256be6ebca5f6b + 5
    frame #8: 0x000000010630f218 backtrace`backtrace::test::f::hc7256be6ebca5f6b + 40
    frame #9: 0x000000010630f218 backtrace`backtrace::test::f::hc7256be6ebca5f6b + 40

where notably the exception is coming from the unwinder. The faulting instruction is movq (%rsi), %rax so it doesn't look like a stack overflow. That being said I'm not really sure what this means.

Also I'm able to reproduce this with stable rustc but is this broken for beta/nightly for you? I can't seem to reproduce on beta/nightly

alexcrichton avatar Jul 07 '20 18:07 alexcrichton

Backtrace is not signal-safe and shouldn't be used from a signal handler. This is also true of println and assert macros.

tmiasko avatar Jul 08 '20 17:07 tmiasko

This library is not async signal safe, but it is safe for synchronous signals. In this case generating a backtrace from a segfault handler is intended to work.

alexcrichton avatar Jul 08 '20 17:07 alexcrichton

Also I'm able to reproduce this with stable rustc but is this broken for beta/nightly for you? I can't seem to reproduce on beta/nightly

I'm able to reproduce this with latest nightly.

losfair avatar Jul 08 '20 17:07 losfair

Whether signal is generated in synchronous or asynchronous manner doesn't change the fact that the signal handler can only use async-signal-safe functions.

Take for example one reason why this crate isn't safe to use from a signal handler: the use of memory allocation routines. If signal is generated during an execution of a malloc, which holds an internal lock, and then the signal handler allocates memory and needs to acquire the same lock, a deadlock will occur.

tmiasko avatar Jul 08 '20 17:07 tmiasko

@tmiasko

We are using backtrace in a JIT runtime and the SIGSEGV handler is for catching bad memory accesses from guest code, including stack overflows.

In that case, the code that produces SIGSEGV is completely isolated from the rest of the process, and there won't be reentrancy issues.

losfair avatar Jul 08 '20 17:07 losfair

Ok I did some more digging into this. I don't think this is an issue that can be fixed and I would recommend that you use a different scheme for catching stack overflow with your JIT code.

The segfault here is in the libunwind unwinder itself, and after researching a bit as to what's going on, it looks like the segfault is happening 16 bytes below the end of the stack. I believe the sequence of events can be reconstructed as:

  • Using libunwind we can get a handful of frames.
  • The frame that segfaults happens when we unwind the first frame of f
  • The frame f faulted in the middle of the function prologue
  • The unwind information for f is stored in a "compact format"
  • The compact format does not have a way to describe how to unwind in the middle of the prologue, instead it only defines how to unwind "during" the function
  • In interpreting the compact unwind information libunwind will hit a segfault again, trying to access memory the function itself faulted trying to push.

The issue here is that a stack overflow exception can happen anywhere in the prologue of a function, but generally unwind tables are not intended for arbitrarily happening in the prologue (there's the notion of "async unwind tables" on some systems for this). This means that the unwinder can't reliably unwind frames that are interrupted in the prologue.

Overall I don't think there's a way to solve this in Rust (unless there's actually a table for async unwind tables on macOS). I would recommend a different strategy for catching stack overflow or otherwise trying to only recover from stack overflow in JIT code which can be tightly controlled and not recovering from native code.

alexcrichton avatar Jul 08 '20 21:07 alexcrichton

unless there's actually a table for async unwind tables on macOS

I'm not 100% sure what you're asking here but n.b. the __TEXT,__unwind_info section does include an index: https://opensource.apple.com/source/libunwind/libunwind-35.1/include/mach-o/compact_unwind_encoding.h.auto.html

luser avatar Jul 22 '20 17:07 luser

Oh what I mean is that to generate a backtrace from a function that segfaulted in its prologue libunwind needs to know how to unwind from every single instruction in the function, not just the "body" after the prologue. AFAIK that's only supported with async unwind tables (and maybe full-dwarf unwind tables?), and I'm not sure how to get LLVM to generate non-compact or async unwind tables.

alexcrichton avatar Jul 22 '20 17:07 alexcrichton

I'm not sure how to get LLVM to generate non-compact or async unwind tables.

LLVM emits non-compact unwind info for the linker to use, but Apple's linker then compacts the table (and there's no way to tell it to do otherwise).

eggyal avatar Oct 08 '21 17:10 eggyal

This PR seems to have served its purpose and hosted the discussion it will have. I will not close the issue that spawned it but it is unlikely to be resolved soon, so as this test isn't going to be accepted as a regression test anytime soon, I am going to close this PR.

workingjubilee avatar Jun 28 '23 22:06 workingjubilee