sentry-cocoa icon indicating copy to clipboard operation
sentry-cocoa copied to clipboard

Skip initialization if crash was caused by the SDK itself

Open bruno-garcia opened this issue 4 years ago • 15 comments

When SentrySDK.start is called and a crash report is detected, skip initialization altogether in case that crash was caused by code within the SDK itself. Until new SDK version / app version it should stay disabled.

bruno-garcia avatar Oct 21 '21 17:10 bruno-garcia

We could identify if a crash is coming from the SDK itself by looking at the stracktrace and sending a special type event to Sentry, so we know that we are crashing. Our users could get a special warning in Sentry so that they are aware something is wrong. Then they could ship a patch do downgrade to the latest stable version.

philipphofmann avatar Oct 21 '21 17:10 philipphofmann

SentryCrash has a similar concept of a recrash report, see https://github.com/getsentry/sentry-cocoa/blob/ad3b44b89858b251d3161f4a689842f1ee1772d6/Sources/SentryCrash/Recording/SentryCrashC.c#L101-L102

philipphofmann avatar Oct 21 '21 17:10 philipphofmann

let me play a bit the defensive person here, if the crash is actually on sending events, raising a new event actually causes a sort of infinite loop, I'm not a big fan of raising SDK's issue because of that. Also, if we introduce a bug on checking if an event is from the SDK or not, might crash everyone. I'm trying to just give food for thought.

marandaneto avatar Oct 22 '21 09:10 marandaneto

if the crash is actually on sending events, raising a new event actually causes a sort of infinite loop

Right, the idea is to just turn off the SDK if we detect the crash we're processing was initiated from the SDK itself. Not to try to capture anything.

bruno-garcia avatar Oct 26 '21 14:10 bruno-garcia

got it, what's about if we indeed shipped a bug'ed SDK, then fix/release it, the SDK is still not going to be able to init by itself since there's a crash there and it bails out, wondering how this would work.

marandaneto avatar Oct 27 '21 08:10 marandaneto

got it, what's about if we indeed shipped a bug'ed SDK, then fix/release it, the SDK is still not going to be able to init by itself since there's a crash there and it bails out, wondering how this would work.

The SDK will check if version changed or app was reinstalled

bruno-garcia avatar Oct 27 '21 14:10 bruno-garcia

One idea that came up during a discussion, to avoid getting completely blind when we turn this off: Have a simple GET request to relay with /projectid/sdk.name/version so we can get metrics on the backend about this feature kicking in. This obviously is a rough idea and requires more discussions with other teams.

bruno-garcia avatar Oct 27 '21 14:10 bruno-garcia

According to @armcknight, this is what they did for Specto:

... we didn’t do anything like inspect a stack trace to make the determination. We simply write marker files at various stages of initialization, and then before starting subsequent initializations, check if we wrote a “init succeeded” marker from the last init. If not, we bail until we see a new sdk/app/OS version.

philipphofmann avatar Dec 30 '21 10:12 philipphofmann

just remembered we wrote a blog post describing the strategy: https://medium.com/specto/preventing-repeated-crashes-on-launch-in-our-sdk-20cb4cc3e430

armcknight avatar Jan 26 '22 19:01 armcknight

This would break the SDK crash detection: https://github.com/getsentry/sentry/issues/44342. We want that the SDK sends the crash event. If our SDK keeps crashing, we have the report start-up crashes feature to ensure the event ends up in Sentry https://github.com/getsentry/sentry-cocoa/pull/2220. If the SDK can't even report the start-up crash anymore, I think we shouldn't skip the initialization but instead keep crashing so our customers are aware. I don't think it's worth the effort to plan for this edge case, and as pointed out by @marandaneto, we could keep crashing anyways. Instead, we should ensure with a proper test suite that this doesn't happen. I don't think this is required anymore. Please reopen with a comment if you disagree.

philipphofmann avatar Jul 13 '23 13:07 philipphofmann

In the case you describe where we cause an app launch crash and say we should let it happen so customers are aware, I disagree. That is the worst possible UX issue we could cause for end users. They still need the app to work, regardless of whether Sentry is working correctly. What you're proposing is that Sentry is more important than the app, I disagree.

We could implement something like a heartbeat by which we present a notification in the frontend like "hey, we stopped receiving events from app version X at datetime Y, something may be wrong with your Sentry installation in production, please investigate and push out an update"

armcknight avatar Nov 27 '23 19:11 armcknight

Agree with Andrew here, we absolutely shouldn't crash the app. If we do, it shouldn't be more than once (if we can do that). Make sure that for that version of the app/Sentry we skip init again.

bruno-garcia avatar Nov 27 '23 23:11 bruno-garcia

In the case you describe where we cause an app launch crash and say we should let it happen so customers are aware, I disagree.

I might have been wrong here. I don't think Sentry is more important than the app. One of the worst things that can happen is that customers think everything is fine although the house is on fire. The worst thing for us is that we repeatedly crash an app. We could come up with a strategy similar to what you described in your blog post to avoid repeated crashes, but we must ensure that the strategy notifies us and tells us that something is broken. We could send a special type of event to Sentry only once after we detected that our SDK repeatedly crashed to avoid using up customer quota. With the SDK crash detection, it would be easy for us to set an alarm.

What made you revisit the decision here, @armcknight? Would you like to reopen the issue?

philipphofmann avatar Nov 28 '23 13:11 philipphofmann

We could try to send a special type of event to notify ourselves of the situation, but barring that, we could fall back on the heartbeat type of strategy as well.

I think I saw this in an old github notification yesterday I'd never followed up on 😄 We could reopen it and backlog it, like I said I'm not sure about prioritization.

armcknight avatar Nov 28 '23 20:11 armcknight

We'll discuss it in our next sync.

philipphofmann avatar Nov 29 '23 15:11 philipphofmann