sentry-native Document Windows Error Reporting (WER) integration

I came across the CRASHPAD_WER_ENABLED flag because I saw it mentioned in the release notes. From what I can gather it makes sense to enable it on Windows in order to gain support for Windows Error Reporting (WER), but I could not find any real documentation about what it is/does. There was a similar question asked over at https://github.com/getsentry/sentry-dotnet/issues/1148, which also mentions that you have to somehow sign up for WER, but it unfortunately also lacks an answer specifically about how WER support works.

The only mention I could find about WER was the following confusing paragraph from here. I’m not sure exactly what "fast-fail crashes" are, and from the paragraph it’s unclear to me whether they require WER support to be supported by Crashpad. It also seems to imply that WER support is always enabled (which I don’t think is the case).

Limitations in Crashpad on Windows for Fast-fail Crashes The Crashpad backend on Windows supports fast-fail crashes, which bypass SEH (Structured Exception Handling) primarily for security reasons. sentry-native registers a WER (Windows Error Reporting) module, which signals the crashpad_handler to send a minidump when a fast-fail crash occurs. However, since this process bypasses SEH, the application local exception handler is no longer invoked, which also means that for these kinds of crashes, before_send and on_crash will not be invoked before sending the minidump and thus have no effect.

I’d appreciate if WER support and the CMake flag could be documented, including how WER support works and when it is used in relation to Crashpad handling.

I also wonder if it should be enabled by default.

Aug 21 '23 08:08 triplef

Hi @triplef, this will be a bit lengthy.

Context

The crashpad WER module is a feature of the upstream crashpad project. We integrated into the Native SDK silently (i.e., without big announcements or documentation) and enabled it by default for target platforms that trivially support it.

Its use case is relatively narrow (although gaining importance). At the same time, we already see many users overburdened by the complexities of native crash reporting (many of which we cannot hide neatly beneath a friendly abstraction). This is also true for the documentation, where we must be super careful to balance the urge to satisfy topics at book length vs. allowing our users to find the most relevant information they need.

This is not an excuse not to provide documentation for CRASHPAD_WER_ENABLED but rather part of a "strategy" (if you allow that overloaded term) to defer documenting some edge-case scenarios until we have evidence that people need it vs dumping information on every little detail.

And, again, if your target platform allows it, you will have it enabled by default, so in many cases, you will get crash notifications for a new class of error without awareness or knowledge of the intricacies their capture necessitates.

What new class of errors?

Typically crash-reporting on Windows uses the SEH facilities to be informed of any crashes the application developer didn't handle. In particular, all crash-reporters will at least register an implementation of an UnhandledExceptionFilter. This is fine as long as the crashes go through SEH (which most do).

A particular class of errors introduced with Windows 8 completely bypasses SEH: Fast Fail Errors. The assumption with these errors is that corrupted stacks can not only be an issue for the SEH mechanism to process the call stack correctly but also that an attacker could abuse them by relying on the execution of any handlers.

So bypassing SEH (which is what "failing fast," in this case, means) is a deliberate choice to prevent the abuse of the default error mechanism on Windows in applications that a malicious actor may have intentionally corrupted. You typically do not __fastfail in application code, but system/runtime libraries will. In short, relying solely on SEH mechanisms won't enable you to detect "fast fail" crashes programmatically.

What does this have to do with WER?

The Windows Error Reporting (short: WER) is a Windows mechanism to record application crashes similar in intent to what Sentry provides. While a large part of WER is about the centralized collection and analysis of crash data, the client-side component registers all application crashes, including fast-fail crashes. This is a minimalistic introduction to what it does and how to use it for this explanation.

This local crash collector allows developers to register a so-called "runtime exception module", a DLL that exposes a callback for WER to call in case of a crash. One could outsource all Windows crash-reporting this way, something I think of regularly for the Native SDK, given the enormous issues we see our users have with UnhandledExpectionFilter overwrites (typically from code they do not control).

How does `crashpad` use that mechanism?

crashpad provides a module implementation that solely reacts to fast-fail errors (and ignores all others because the SEH handler already covers them). This implementation interacts with the crashpad client in the crashing process to produce a minidump that, in turn, gets sent via the crashpad_handler back to Sentry.

We register the WER module if present during sentry_init().

It is crucial to remember that this is all about the WER interaction. crashpad solely uses it to collect error context for crashes that would otherwise go unnoticed. It is not - in any way - a full-featured WER integration.

Why do we call this out in "Known Issues"?

Because we enable the WER module by default on target platforms that support it, users might get an error to which our backend doesn't invoke our callbacks (before_send, on_crash). This is because SEH will not invoke our crash handler either, which, in turn, would call these.

Aug 23 '23 09:08 supervacuus

Hi @supervacuus, thank you very much for the super detailed and helpful explanation! 🙏

The original reason why I started looking into WER was that we see some users reporting crashes that never show up in Sentry, and I was hoping that WER might help with that.

Since you helpfully mentioned UnhandledExceptionFilter overwrites I now came across the discussion in https://github.com/getsentry/sentry-native/issues/833, and users mentioning Qt (which we also use) overwriting the exception filter. I’ll try the workaround of overwriting the setter function with a noop, though in our case we also need to first install an exception handler for Objective-C exceptions (see https://github.com/gnustep/libobjc2/pull/220), so I need to see how that will all work together.

With this in mind using WER for all crashes definitely sounds very appealing to me. We don’t need before_send and on_crash, and would much prefer more reliable crash reporting if this was an option.

Lastly I wanted to mention that I was also confused about the use of CRASHPAD_WER_ENABLED in sentry-native’s CMakeLists.txt since it is never set there, but now I realized that it is of course set by Crashpad itself and propagated to the parent scope – that’s why WER support is indeed enabled by default as indicated by the "WER support enabled" CMake configure output.

Aug 23 '23 11:08 triplef

Since you helpfully mentioned UnhandledExceptionFilter overwrites I now came across the discussion in #833, and users mentioning Qt (which we also use) overwriting the exception filter. I’ll try the workaround of overwriting the setter function with a noop, though in our case we also need to first install an exception handler for Objective-C exceptions (see gnustep/libobjc2#220), so I need to see how that will all work together.

I have not yet responded in that issue at required length, but in short: I am not a big fan of replacing the setter with a noop. This should be an absolutely last resort solution that is only done at the application level (never library or framework level). There are way too many sensible short-term usages of that handler and we actually have already had issues from users where they are interacting with libraries that overwrite the setter before they are able initialize sentry. So adding this functionality to Native SDK is only moving the goalpost.

I tend to prefer to write FAQ-like documentation that helps users in such a situation to identify the culprit. Or if the a specific library or framework is known, to provide a concrete workaround. For instance if you start your Qt application via appman (Application Manager) you can disable its crash-handler via the AM_NO_CRASH_HANDLER environment variable.

With this in mind using WER for all crashes definitely sounds very appealing to me. We don’t need before_send and on_crash, and would much prefer more reliable crash reporting if this was an option.

Pls don't hold your breath on this one... this would mean a considerable effort as it is a essentially a new backend that only serves windows.

Lastly I wanted to mention that I was also confused about the use of CRASHPAD_WER_ENABLED in sentry-native’s CMakeLists.txt since it is never set there, but now I realized that it is of course set by Crashpad itself and propagated to the parent scope – that’s why WER support is indeed enabled by default as indicated by the "WER support enabled" CMake configure output.

Yes, that is how it works. I am also not 100% happy with it, but at the time it was the most sensible thing to do and I couldn't think of a much better approach at this point either.

Aug 25 '23 14:08 supervacuus

Document Windows Error Reporting (WER) integration

Context

What new class of errors?

What does this have to do with WER?

How does crashpad use that mechanism?

Why do we call this out in "Known Issues"?

How does `crashpad` use that mechanism?