sentry-native icon indicating copy to clipboard operation
sentry-native copied to clipboard

Alternative backends on Windows don't catch crashes caused by `abort()`

Open BenjiHansell opened this issue 3 years ago • 16 comments

Using Breakpad or inproc on Windows results in most crashes not being captured.

When does the problem happen

  • [ ] During build
  • [ ] During run-time
  • [x] When capturing a hard crash

Environment

  • OS: Only Windows (in my case, Windows 10, 64-bit in VirtualBox)
  • Compiler: MSVC 19
  • CMake version and config: 3.21.2 -D SENTRY_BACKEND=breakpad -D BUILD_SHARED_LIBS=OFF

Steps To Reproduce

Just a simple boilerplate setup does it. Using abort() to cause the crash.

int main() {
  std::cout << "hello world" << std::endl;

  sentry_options_t* options = sentry_options_new();
  sentry_options_set_dsn(options, 'dsn redacted');
  sentry_options_set_debug(options, 1);

  sentry_init(options);

  std::this_thread::sleep_for(std::chrono::milliseconds(2000));

  abort();

  sentry_close();
}

Log output

This is the logs when it doesn't work. As you can see, no crash is captured.

hello world
[sentry] INFO using database path "C:\Users\vagrant\NonRepos\SentryNativeTutorial\.sentry-native"
[sentry] DEBUG starting transport
[sentry] DEBUG starting background worker thread
[sentry] DEBUG starting backend
[sentry] DEBUG background worker thread started
The terminal process "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -Command ./install/bin/tutorial.exe" terminated with exit code: 1.

However, if I swap out abort(); for

  volatile char* invalid_memory = nullptr;
  *invalid_memory = '\0';  // crash

to invoke a different kind of crash, I get

hello world
[sentry] INFO using database path "C:\Users\vagrant\NonRepos\SentryNativeTutorial\.sentry-native"
[sentry] DEBUG starting transport
[sentry] DEBUG starting background worker thread
[sentry] DEBUG starting backend
[sentry] DEBUG background worker thread started
[sentry] INFO entering breakpad minidump callback
[sentry] DEBUG merging scope into event
[sentry] DEBUG adding attachments to envelope
[sentry] DEBUG sending envelope
[sentry] DEBUG serializing envelope into buffer
[sentry] INFO crash has been captured
The terminal process "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -Command ./install/bin/tutorial.exe" terminated with exit code: 1.

That is the first run, and on subsequent runs I see the envelope is uploaded and appears in the Sentry UI.

Similarly, if I change over to use crashpad as the backend, things work fine (but crashpad is not an option for me on this project). Things also work fine on Linux and Mac with the breakpad backend.

In my project, there's a lot of assertions so most crashes are caused in the way that Windows can't handle them.

For now, my hacky workaround is to register my own signal handler like so:

[[noreturn]] void windows_signal_handler(int signal) {
  volatile char* invalid_memory = nullptr;
  *invalid_memory = '\0';  // raising SEH exception EXCEPTION_ACCESS_VIOLATION_WRITE
  // unreachable
}
std::signal(SIGABRT, windows_signal_handler);

This results in the full stack trace appearing on the UI, albeit with an extra frame or two.

BenjiHansell avatar Sep 10 '21 08:09 BenjiHansell

Thanks for sharing this observation. When compared with Breakpad, is using the inproc backend an option for you with its lack of minidumps? I'm assuming that you can't use Crashpad because you need a single executable or even process.

We'll make sure to add this to the inproc backend and expand our test matrix. While we say that the inproc backend technically supports Windows, it's not on par with Crashpad. We can further clarify this in the documentation. (cc @Swatinem)

Since we're planning to remove the Breakpad backend eventually, we're not going to add this there. Of course, we

jan-auer avatar Sep 21 '21 07:09 jan-auer

Thanks for sharing this observation. When compared with Breakpad, is using the inproc backend an option for you with its lack of minidumps? I'm assuming that you can't use Crashpad because you need a single executable or even process.

We'll make sure to add this to the inproc backend and expand our test matrix. While we say that the inproc backend technically supports Windows, it's not on par with Crashpad. We can further clarify this in the documentation. (cc @Swatinem)

Since we're planning to remove the Breakpad backend eventually, we're not going to add this there. Of course, we

Thanks for this clarification.

I don't think inproc is an option for us. We've gone with Breakpad due to the lack of minidumps from inproc. I hadn't realised the Breakpad backend was going to be removed though.

Is there perhaps any plan to add minidump functionality to the inproc backend before deprecating Breakpad?

P.S. The reason we can't use Crashpad is actually more nuanced. Yes the extra binary is inconvenient, but we can (and previously did) work with that. The bigger problem is that we are now using a custom transport, and Crashpad doesn't send crashes through Sentry Native's transport, instead it makes its own HTTP connection in the Crashpad process. HTTP isn't a reliable option in our high-security environment.

BenjiHansell avatar Sep 21 '21 10:09 BenjiHansell

I don't think inproc is an option for us. We've gone with Breakpad due to the lack of minidumps from inproc.

Thanks for clarifying. I think we'll still try to capture abort() from the inproc backend since its a very reasonable fix, but then this clearly won't resolve your issue.

I hadn't realised the Breakpad backend was going to be removed though.

We haven't started advertising this yet since it's not on any immediate roadmap. However, it's our understanding that the Breakpad client is in maintenance mode even at Google, and it adds considerable overhead maintaining both. For now, you can rely on the Breakpad backend for a little. That said:

The bigger problem is that we are now using a custom transport, and Crashpad doesn't send crashes through Sentry Native's transport, instead it makes its own HTTP connection in the Crashpad process. HTTP isn't a reliable option in our high-security environment.

This is interesting insight, thanks for bringing this up. While this now strays from the original topic, I think there might be alternate solutions. Just so I understand, the issue with Crashpad's transport is purely that it uses HTTP instead of HTTPS? Phrased differently, if we fix that, would you be able to go with Crashpad?

jan-auer avatar Sep 21 '21 10:09 jan-auer

This is interesting insight, thanks for bringing this up. While this now strays from the original topic, I think there might be alternate solutions. Just so I understand, the issue with Crashpad's transport is purely that it uses HTTP instead of HTTPS? Phrased differently, if we fix that, would you be able to go with Crashpad?

Actually it's not a HTTPS thing. To be more specific, we're using gRPC for our custom transport because that's pretty much the only protocol that is permitted to leave the secure area of our network. So, we have a custom transport which communicates with a gRPC->HTTP proxy which passes envelopes to the Sentry cloud.

BenjiHansell avatar Sep 21 '21 10:09 BenjiHansell

Makes sense, thanks. Let me circle back with @Swatinem which of these options would make most sense, and it could be multiple:

  • At very least, document limitations of secondary backends.
  • Patch a signal handler for std::abort into Breakpad's win client in a minimal effort, double-checking past work. It's actually surprising to me that it's not there yet, but we'll have to take a deeper look.
  • Same for the in-proc backend
  • Add a way to provide a transport function to crashpad_handler. This comes with the more general challenge of customizing the handler process which we haven't tackled much, yet. I think the biggest challenge is around build tooling and a neat API for this.

Until then, you might have another workaround on your end. The SDK always vendors the getsentry branch of our Crashpad fork. The Windows HTTP transport, particularly ExecuteSynchronously is defined here. You could either provide your own implementation of the HttpTransport base class and swap that in, or patch out the Windows implementation to run gRPC internally. The file's history seems fairly stable, which might make a custom patch feasible.

jan-auer avatar Sep 21 '21 11:09 jan-auer

Makes sense, thanks. Let me circle back with @Swatinem which of these options would make most sense, and it could be multiple:

  • At very least, document limitations of secondary backends.
  • Patch a signal handler for std::abort into Breakpad's win client in a minimal effort, double-checking past work. It's actually surprising to me that it's not there yet, but we'll have to take a deeper look.
  • Same for the in-proc backend
  • Add a way to provide a transport function to crashpad_handler. This comes with the more general challenge of customizing the handler process which we haven't tackled much, yet. I think the biggest challenge is around build tooling and a neat API for this.

Until then, you might have another workaround on your end. The SDK always vendors the getsentry branch of our Crashpad fork. The Windows HTTP transport, particularly ExecuteSynchronously is defined here. You could either provide your own implementation of the HttpTransport base class and swap that in, or patch out the Windows implementation to run gRPC internally. The file's history seems fairly stable, which might make a custom patch feasible.

Many thanks. For now we'll stick with Breakpad as we've got that implemented and working for the time being. We'll bear the Crashpad patch solution in mind for the future though.

BenjiHansell avatar Sep 21 '21 12:09 BenjiHansell

Just wanted to chime in that we are working through an interesting integration (using Swift on Windows) and trying to port some bindings from the native client into a small Swift library and encountered the same exact issue. Is the recommended work around to just install a signal handler in client code and trigger a different exception?

We could use breakpad I suppose, but it sounds like that support is going away?

brianmichel avatar Sep 29 '23 19:09 brianmichel

Breakpad support isn’t going away anytime soon.

Swatinem avatar Oct 02 '23 08:10 Swatinem

Just wanted to chime in that we are working through an interesting integration (using Swift on Windows) and trying to port some bindings from the native client into a small Swift library and encountered the same exact issue. Is the recommended work around to just install a signal handler in client code and trigger a different exception?

Which Native SDK backend do you currently use? In general, on Windows abort() is implemented as a __fastfail exit, which means it bypasses the SEH error handling (the main mechanism used by all of our backends to capture unexpected program termination). It still raises a synchronous SIGABRT signal before it exits, but only crashpad installs a SIGABRT handler on Windows. So, if crashpad is an option for you, it would immediately provide that capability.

For the other backends, this would have to be implemented, but it is unclear if this is going to be prioritized any time soon.

If you own the code that calls abort(), on Windows, an alternative would be to call RaiseException(). If you don't own the code that calls abort(), indeed a workaround would be to install a Windows-specific signal-handler for SIGABRT and call RaiseException() from there. Be aware that on Linux this is neither necessary nor recommended because the Native SDK installs its own signal handler for all abnormal program terminations.

We could use breakpad I suppose, but it sounds like that support is going away?

As @Swatinem mentioned above, this perspective has changed over time. There are a couple of use cases where the deployment scenario for the breakpad backend is preferred. But for your particular issue, I wonder whether using breakpad will help because, as described by the original issue description, breakpad currently also doesn't implement a SIGABRT handler on Windows.

supervacuus avatar Oct 02 '23 08:10 supervacuus

@supervacuus actually it seems like the crashpad handler is working when I call abort directly, but when I end up calling fatalError I see this last message and then the app just hangs forever and doesn't terminate:

“handing off to crash pad handler”

I want to do some debugging with the same sentry-native SDK on macOS with Swift to see if I encounter the same behavior so I can start isolating a test case.

Good call out with the RaiseException work around I might play around with that as well. I understand we're in an interesting spot with using Swift on Windows so some of the exception handling is kind of confusing, but will report back with some more info!

I think the root issue is whatever fatalError is doing under the hood which I need to investigate since we're clearly seeing things work correctly with abort()

brianmichel avatar Oct 02 '23 17:10 brianmichel

Hi again, @brianmichel.

actually it seems like the crashpad handler is working when I call abort directly, but when I end up calling fatalError I see this last message and then the app just hangs forever and doesn't terminate:

“handing off to crash pad handler”

You are now the second user who mentioned this behavior (being stuck after our SDK returned to the crashpad handler). By accident, I could reproduce this behavior for the first time on my local Windows machine as well and in a stable/repeatable manner.

Can you tell me which Windows and Visual Studio version you use for development (I use Windows 11, build: 22621.2361, Visual Studio 17 2022 targeting SDK version 10.0.22621)?

I could imagine this is unrelated to the execution paths below fatalError but rather a race between the crashpad client (resident in your application) and the crashpad_handler. But I am very interested in whatever you find out.

I want to do some debugging with the same sentry-native SDK on macOS with Swift to see if I encounter the same behavior so I can start isolating a test case.

You will observe very different behavior on macOS because the crash-handling mechanism is entirely different (not only technologically but also in how the crashpad client and the crashpad handler cooperate in producing a dump). On any supported version of macOS an unhandled SIGABRT will generate an EXC_CRASH mach exception. We will send a minidump for every EXC_CRASH, but you can only recover that it was an abort() from the stack trace.

supervacuus avatar Oct 02 '23 17:10 supervacuus

I'm using Windows 11, build: 22621.2283 and we don't actually use VS for compiling our application, but we've been targeting 10.0.18362.0 for our application.

Good point on the handling on macOS, might not be completely worth it since the handling paths are so different. I can probably publish what I have as a sample for you to test against. you will need to setup a Swift on Windows environment (pretty easy), but doable.

brianmichel avatar Oct 03 '23 17:10 brianmichel

@supervacuus so internally the Swift for Windows implementation ends up calling into __ud2() and I've boiled down a sample C application that will replicate the same symptoms of the "handing off to crash pad handler" stall, the program looks like this:

#include "sentry.h"
#include <intrin.h>

int main() {
  sentry_options_t* options = sentry_options_new();
  sentry_options_set_dsn(options, "dsn-goes-here");
  sentry_options_set_symbolize_stacktraces(options, 1);
  sentry_options_set_environment(options, "development");
  sentry_options_set_database_path(options, ".sentry-native");
  sentry_options_set_release(options, "[email protected]");
  sentry_options_set_debug(options, 1);
  sentry_init(options);

  __ud2();

  sentry_close();
}

Internally we found that calling __ud2() means you end up in https://github.com/chromium/crashpad/blob/main/client/crashpad_client_win.cc#L190 however the ExceptionCode is actually STATUS_ILLEGAL_INSTRUCTION not STATUS_HEAP_CORRUPTION. If you patch this function to look like this:

LONG WINAPI HandleHeapCorruption(EXCEPTION_POINTERS* exception_pointers) {
  switch (exception_pointers->ExceptionRecord->ExceptionCode) {
    case STATUS_HEAP_CORRUPTION:
    case STATUS_ILLEGAL_INSTRUCTION:
      return UnhandledExceptionHandler(exception_pointers);
    default:
      return EXCEPTION_CONTINUE_SEARCH;
  }
}

You get a little further, but then a DCHECK fails generating the minidump it seems.

I'd love to see if you can replicate the issue, and if maybe there's a way to work around this, or if I need to file an upstream bug with Crashpad.

brianmichel avatar Oct 04 '23 23:10 brianmichel

Hi again, @brianmichel, thanks for the investigation. Given your description of the problem, we can take a step back (which is a good thing in this case).

You are indeed experiencing a failing DCHECK, which exits the crashpad_handler (if you run a Debug build), and this is also the cause of your hang: The crashpad client (the part of crashpad resident in your application) is waiting for a response from the crashpad_handler to know when it can shut down. The DCHECK exits the crashpad_handler before it finishes the minidump generation and, in turn, never signals the client to shut down.

The HandleHeapCorruption exception handler is a red herring in this scenario. It is a vectored exception handler that gets called before the UnhandledExceptionFilter the crashpad client installed so that all exceptions will pass through this handler; the crashpad developers only added the particular case to capture heap corruptions (which we can't catch with the UnhandledExceptionFilter, by design).

If you build with Release or RelWithDebInfo, you will see that your hang will go away, even without changing the HandleHeapCorruption handler.

Technical aside: I will also remove/adapt the DCHECK in the crashpad_handler in our fork. This way, Debug builds will no longer hang. I understand that crashpad developers added the check because they already tested the CET mask in the caller. Still, even if the CET mask is set, the CET registers coming from the crash context can hold no values (both MSRs set to 0, which is the initial configuration for CET_U in the XSAVE-managed state, from chapter 13.6 in the Intel Programmers manual, i.e., the shadow-stack is also disabled for this thread).

supervacuus avatar Oct 09 '23 12:10 supervacuus

Ah, this did in fact go away when I re-created a release build 🤦 looks like I'm back up and running! Apologies for the distraction @supervacuus and thanks for the help!

brianmichel avatar Oct 09 '23 22:10 brianmichel

Ah, this did in fact go away when I re-created a release build 🤦 looks like I'm back up and running! Apologies for the distraction @supervacuus and thanks for the help!

No worries at all. This is a weird behavior that should be fixed. Thanks for bringing it up!

supervacuus avatar Oct 10 '23 08:10 supervacuus