rfcs Introduce additional types of exceptions next to `mechanism.handled` exceptions

trafficstars

This RFC suggests a feature which introduces additional types of exceptions next to mechanism.handled.

Currently, exception which cause the running software to exit are marked as handled: false. This isn't enough for SDKs where an exception can be unhandled but at the same time doesn't cause the software to exit. The aim of this RFC is to fix that.

Rendered RFC

Disclaimer: I'm no Sentry employee. However, this is an issue I would really love to see fixed, and I was directed here, to write up my ideas. I hope that's okay.

This PR also looks a bit related to this issue: https://github.com/getsentry/relay/pull/306

Sep 17 '22 18:09 ueman

Just for context: that were my notes about this feature back then. This RFC addresses all of them.

Screenshot 2022-09-18 at 09 40 23

Sep 18 '22 07:09 marandaneto

Relates to https://github.com/getsentry/rfcs/pull/15 as well.

Sep 28 '22 09:09 marandaneto

LGTM. I like option 1 because of the backwards-compatibly.

Definitely applies for .NET and .NET MAUI as well.

Sep 28 '22 18:09 mattjohnsonpint

@ueman Thank you for writing this up! It's been on my list for a long time, because it's also decidedly broken in JS, in a number of different ways.

On the browser side of things:

True crashes (the kind that land you here: chrome://crash/) are pretty rare. That's fortunate, because at that point the JS engine has gone down in flames (taking Sentry along with it) so we have no good way to record what's happened. (There's no publicly-available native browser API to capture such an event the way you can with, say, a minidump.)
Nonetheless, we mark some errors as unhandled, specifically those which bubble up to the global handlers. While it's true they haven't been handled (unless the user has add a global onerror handler themselves, which is not something we currently try to detect), they're also not crashes. (They show up in release health metrics that way, though.)
Errors which are caught by our auto-instrumentation (1, 2, 3, 4, 5) are marked as handled, even though they have not, in fact, been handled by the user.
Events which are handled by the user are correctly marked as being handled.

So the data is wrong in many cases, but it's also kind of arbitrary and divorced from reality, in both directions: It's totally possible for some browser extension the user has installed to be blowing up the console with unhandled errors, causing all of the sessions in the main web app to be marked as crashed, even though the user is blissfully unaware that anything's wrong (and the erroring code has nothing to do with the app integrating sentry). It's also possible for, say, a click handler to error, the site to therefore become unresponsive to at least some user interactions, and for our auto-instrumentation to catch it and mark it handled (and therefore a healthy session), even though the user's experience is of an app which has frozen up.

On the node side, we have both the same (too much handled) and the opposite (too much unhandled) problem:

True crashes (anything which causes the main process to exit) do happen, but only sometimes trigger the beforeExit event. We also sometimes purposefully exit the process.
Similar to the browser SDK, the node SDK marks unhandled errors and rejections which make it to the global handler as unhandled, but depending on Sentry.init settings, this may or may not result in the process being force-exited. Also similar to browser-land, all other errors are marked handled, whether they were caught by the user or by auto-instrumentation.
If the node SDK is being run inside the serverless SDK, though, all errors are considered unhandled, regardless of whether or not the user catches the error.

Ultimately, there are really three problems/challenges/considerations/whatever that we're dealing with:

We'd like some way for the data model to distinguish between true crashes and unhandled errors (the main subject of this RFC).
As has been alluded to in other comments, making changes here has consequences for both the UI (the red 💀 badge) and downstream data (session statuses and the resulting crash-free rate for a given release).
We should be emitting values which better reflect reality (by itself not a hard change, but one which needs the above two issues to get figured out first).

As a result, anything we do is going to have to involve not only SDK folks but also product folks (together making a decision) and then design/UI folks, and possibly ingest folks as well (in order to fully implement any changes). We've got some good SDK representation here (though @sl0thentr0py, @antonpirker, or @cleptric, would love to get a backend SDK perspective), but not yet any opinions from product. @smeubank, do you know what product person is in charge of release health?

Relevant issues: https://github.com/getsentry/sentry-javascript/issues/5408 https://github.com/getsentry/sentry-javascript/issues/5375

Sep 30 '22 04:09 lobsterkatie

Thanks for the detailed message @lobsterkatie @smeubank and Daniel Khan will discuss this in the next planning.

Sep 30 '22 07:09 marandaneto

One general question: How can we send data to sentry, when the process terminates? Is this even possible? (can we store the error fast enough to disk to send on next start?)

Sep 30 '22 10:09 antonpirker

And @lobsterkatie can this new thing catch real browser crashes: https://web.dev/reporting-api/ maybe?

Sep 30 '22 10:09 antonpirker

One general question: How can we send data to sentry, when the process terminates? Is this even possible? (can we store the error fast enough to disk to send on next start?)

yes

Sep 30 '22 11:09 marandaneto

And @lobsterkatie can this new thing catch real browser crashes: web.dev/reporting-api maybe?

Not really. Or, rather, technically yes, but not in all browsers, and not with really any useful data. The problem is, it's not a hook like onerror is, it's just the ability to set a URL to receive a set data payload. (It's the same way CSP reports work.) And in the case of a crash, that data payload you get is pretty underwhelming:

(You can test it out by using this demo site.)

Oct 04 '22 23:10 lobsterkatie

transfering questions from Eran Arkin

A few questions:

What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?
Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level?
I’m not sure what’s the cost of piping this change to the UI (i.e., filters, search, alerts, etc.). So that’s something we will need to budget.

Points 1 & 3 I think developers here can answer from experience working with and on Sentry

point 2, % of customer based would be for all platforms where the boolean doesn't really apply. In theory they would all be impacted

Oct 24 '22 14:10 smeubank

transfering questions from Eran Arkin

A few questions:

What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?

Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level?

I’m not sure what’s the cost of piping this change to the UI (i.e., filters, search, alerts, etc.). So that’s something we will need to budget.

Points 1 & 3 I think developers here can answer from experience working with and on Sentry

point 2, % of customer based would be for all platforms where the boolean doesn't really apply. In theory they would all be impacted

What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?

This is answered in the RFC already, but I can elaborate more: You'd be able to prioritize issues better crashes > unhandled > hadled and ignore less important issues (cut noise). Because of the importance of issues, you can triage quicker and alert only for the important ones, saving you time to be alerted for something not so important or not being alerted at all.

I’m not sure what’s the cost of piping this change to the UI (i.e., filters, search, alerts, etc.). So that’s something we will need to budget.

The changes on the SDKs should be very quick, but I can't estimate filters, search, alerts etc, it's not part of my scope.

Oct 24 '22 14:10 marandaneto

transfering questions from Eran Arkin

A few questions:

What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?

Yep, currently the problem is that our app basically never crashes (as in the app quits, that's how the platform works), so there's no easy way to differentiate between exception which we caught ourselves vs those that were unhandled but didn't crash (as in quitting the app) the application. And those are the one we're interested in fixing.

Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level?

No clue, since I'm not working for Sentry but for a company which is a paying enterprise customer. Workarounds are filtering based on the mechanism with which the exception was reported, but that requires that the SDK does add a mechanism in those cases and it also requires a probably unrealistic good understanding of the SDK and UI. I just happen to have contributed a lot of code to Sentry SDKs, so I'm not a good example.

Since Sentry is just starting to push into areas where this makes a difference (the Flutter, Unity and React.Native SDKs are relatively new compared to the pure Android and iOS SDKs) this is also a growth opportunity and not just an improvement opportunity.

Oct 24 '22 15:10 ueman

What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?

I'd echo everything @marandaneto and @ueman's said, and add that for JS, what the customer gains is data which isn't totally bogus, which feels like a good thing in and of itself. I/we would have fixed the data long ago if there weren't UI/product consequences, but even then it would have been only semi-accurate, because the reality is that there really is a distinction between crashes and unhandled errors. If we were to fix the booleans to be the correct booleans tomorrow, people's crash rates on their releases would skyrocket, because all sorts of things we're incorrectly marking as handled would get marked as unhandled and therefore count as crashing the app, even though that's not reflective of reality. (There's the added complication I detailed above of what exactly does count as a crash in JS, especially on the browser side, but at least if there were three options, there'd be the possibility of reporting accurate data, and then it'd just be on us SDK devs to figure out what to send when.)

Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level? point 2, % of customer based would be for all platforms where the boolean doesn't really apply. In theory they would all be impacted

There's no real workaround here. As for impact, I don't know the total numbers, but the mere fact that it affects JS SDK users already means we're talking about a plurality of our customers, both paid and unpaid. I wouldn't be surprised if adding in all of the other affected platforms pushed us past 50%.

Oct 25 '22 02:10 lobsterkatie

FWIW, I wrote up the JS-specific parts of this issue here: https://github.com/getsentry/sentry-javascript/issues/6073

The JS SDK team chatted and agreed that as a first step, we will start sending improved (even if not perfect) data somewhere else (TBD where) in the event, so that a) the scale of the problem can be quantified, and b) any backend or UI folks who do eventually pick up that end of the work have example data to work with. (This will likely be somewhere viewable in the JSON but not directly in the UI for the moment, since for now it's really just analytics.)

I'll update here once exactly where and exactly what we'll send is hammered out. Once we do that, maybe other SDKs can do a similar thing.

Oct 27 '22 15:10 lobsterkatie

I agree in general with your idea of having three separate categories to split errors recorded manually by users, errors recorded by Sentry auto-instrumentation, and errors which crash the program.

However, I think we should be taking a step back here to reconsider our terminology and how we indicate error severity to users. The current handled/unhandled terminology is somewhat misleading in my opinion, since at least in the Python SDK that I am most familiar with, it does not actually indicate whether the exception gets caught, but rather whether the exception was reported manually by the user or by the SDK itself via an instrumentation. Furthermore, the UI visually indicates "unhandled" exceptions as more severe than "handled" ones, and I question whether this is necessarily the case.

For this reason, I propose changing the handled/unhandled terminology to something more meaningful, and I also propose creating a separate measure of error severity.

Capturing mechanism

For the capturing mechanism, we should have two categories to separate manually and automatically captured errors. These would replace the current labels of "handled" and "unhandled," respectively.

Manually captured: Errors which users record via Sentry.capture* methods.
Automatically captured: Errors which the Sentry SDK recorded automatically.

Unlike the current “handled” and “unhandled” categories, automatically captured errors would not necessarily be marked as higher severity than manually captured ones.

Severity levels

To indicate error severity, we should have a separate concept of error severity. I propose three error severity levels as follows, but am open to other suggestions:

Low: Errors which are less likely to be caused by a bug, and more likely to be caused by user error, for example a 404 error returned by a web server. This category would be the default for manually captured errors, but users would be able to set a high severity on a per-event basis, instead.
High: Errors which likely indicate a bug in the code or some other more serious issue that developers need to fix, for example a 500 error returned by a web server. This category would likely be the default for most types of automatically captured errors, but users would be able to set a low severity on a per-event or per-error-type basis, instead.
Critical: Errors which crashed the process. This category would be reserved for crashes; i.e. users would be unable to manually mark an error as critical, and errors which crash the process would always be marked as critical regardless of configurations.

This separate severity level mechanism would be more flexible, since users would have the option to set a different severity level for certain errors or configure a different default level for certain classes of errors, instead of tying the severity level to how the error was captured.

Sep 18 '23 15:09 szokeasaurusrex

I generally like the idea of moving away from handled/unhandled. Two things:

For the capturing mechanism, we should have two categories to separate manually and automatically captured errors. These would replace the current labels of "handled" and "unhandled," respectively.

For the mechanism, we already have the type field. Maybe the way to go is to standardize what values we set for the respective cases and leave handled unset by default.

we should have a separate concept of error severity

I guess severity levels are hard to define well across different platforms. For example, in Browser JS we can’t detect a full crash and it basically never happens. Also besides Http error responses (which not all of our SDKs capture), it’s probably hard to distinguish between high/low reliably.

Whatever we do here, if we want to solve this for the entire product, we need to coordinate this change with how we set the session status and, based on that, how we calculate release health. Maybe the key here is adapting the calculation to a certain platform, so that it makes (somewhat) sense for web, mobile and backend/server projects respectively.

Sep 18 '23 15:09 Lms24

https://sentry.zendesk.com/agent/tickets/146342

Mar 10 '25 19:03 InterstellarStella

rfcs rfcs copied to clipboard

Introduce additional types of exceptions next to `mechanism.handled` exceptions

Capturing mechanism

Severity levels

rfcs
rfcs copied to clipboard