firefox-tv icon indicating copy to clipboard operation
firefox-tv copied to clipboard

Measure crash rate over time with Glean

Open athomasmoz opened this issue 5 years ago • 22 comments

Why/User Benefit/User Problem

As a Product Owner, I want to track our crash rate over time, so that I can ensure new releases aren't increasing it.

What / Requirements

  • A dashboard is available that charts the crash totals and rates over time

Acceptance Criteria (how do I know when I’m done?)

  • Number of crashes is captured
  • Number can be turned into a rate of crashes to DAU
  • Charts are available to view those numbers over time

Approach

  • Implementation from focus can be re-used here
  • A-C docs on crash component
  • Using AS' Build -> Analyze APK method, verify that release builds DO NOT add new permissions (in particular, FOREGROUND_SERVICE). If they do, raise with Ashley and we'll discuss moving forward

athomasmoz avatar Mar 08 '19 16:03 athomasmoz

This story is blocked on confirming that we already have a way to track crashes investigation

athomasmoz avatar Mar 08 '19 16:03 athomasmoz

@cpeterso Does GV use Socorro built-in for crash reporting? Or would we need to implement our own? @Dexterp37 Could we integrate glean with GV to get crashes? @pocmo On Fenix, woudl you include both glean & sentry for crash reporting? Or just one?

athomasmoz avatar Mar 27 '19 21:03 athomasmoz

@cpeterso Does GV use Socorro built-in for crash reporting? Or would we need to implement our own?

GV doesn't have a built-in crash reporter. When GV crashes, it passes the crash report to the app, who is responsible for showing any crash reporter permission UI and actually uploading the crash report to a server.

The good news is that AC already has a crash reporter component (lib-crash) that handles the localized permission UI and uploading to Socorro and/or Sentry. Integrating lib-crash is FFTV issue #1877.

Fenix and RB already use the crash reporter component, so FFTV can copy their code:

mozilla-mobile/fenix#267 mozilla-mobile/reference-browser#25

@pocmo On Fenix, woudl you include both glean & sentry for crash reporting? Or just one?

I think you will need both. Sentry includes stack traces for debugging, but only knows about crashes that users permitted to be uploaded to Mozilla. So Sentry doesn't know the actual number of crashes users are experiencing. Glean telemetry can be used to measure the actual number, even those that the user did not permit to be uploaded to Sentry. Firefox desktop has similar "crash ping" telemetry.

cpeterso avatar Mar 28 '19 07:03 cpeterso

@Dexterp37 Could we integrate glean with GV to get crashes?

I think there are ongoing discussion about planning the integration of lib-crash with glean. @travis79 and @georgf are following that effort.

Dexterp37 avatar Mar 28 '19 09:03 Dexterp37

@pocmo On Fenix, woudl you include both glean & sentry for crash reporting? Or just one?

Fenix uploads crash reports to Sentry (creating actionable crash reports for app developers) and Socorro (creating actionable crash reports for GeckoView developers) - handled by lib-crash. Glean is used for telemetry purposes only at the moment; but as @Dexterp37 mentioned there are plans to get better crash statistics using glean eventually.

So Sentry doesn't know the actual number of crashes users are experiencing.

This is correct. And in addition to that Sentry doesn't know about users that do not crash and therefore can't calculate crash rates etc. - This is one of the things we hope to improve with glean eventually.

pocmo avatar Mar 28 '19 10:03 pocmo

@cpeterso @Dexterp37 I see that the lib-crash docs say it supports native code crashes - would this also catch WebView crashes? We're still trying to figure out a way to count crashes for the System WebView, but if it does, maybe we can add lib-crash early on and then compare stats pre/post-GeckoView.

liuche avatar Apr 10 '19 21:04 liuche

I see that the lib-crash docs say it supports native code crashes - would this also catch WebView crashes? We're still trying to figure out a way to count crashes for the System WebView, but if it does, maybe we can add lib-crash early on and then compare stats pre/post-GeckoView.

I don't know, but I suspect lib-crash would not be able to catch System WebView crashes. Does WebView run its own content process that might crash without crashing the FFTV app? WebView might also have its own crash handling that interferes with Breakpad (Gecko's native crash reporter also uses in lib-crash).

If you just need to count crashes, perhaps FFTV could check on startup whether the last session exited cleanly? Set some flag on startup that is cleared on a clean exit? This approach might not be compatible with Android's background process killing or if users shut off their Fire TV devices while FFTV is running.

Also relevant: lib-crash has some problems catching native crashes in some threads on Firefox Reality: https://bugzilla.mozilla.org/show_bug.cgi?id=1493232

cpeterso avatar Apr 11 '19 00:04 cpeterso

@cpeterso @Dexterp37 I see that the lib-crash docs say it supports native code crashes - would this also catch WebView crashes? We're still trying to figure out a way to count crashes for the System WebView, but if it does, maybe we can add lib-crash early on and then compare stats pre/post-GeckoView.

Sorry, I don't really know about this :( @pocmo might be able to better answer it!

Dexterp37 avatar Apr 11 '19 08:04 Dexterp37

@cpeterso @Dexterp37 I see that the lib-crash docs say it supports native code crashes - would this also catch WebView crashes?

Unfortunately it won't. It relies on cooperation from GeckoView to catch those crashes.

pocmo avatar Apr 12 '19 14:04 pocmo

Clearing milestone & priority labels to send this back to triage: Glean is ready to support this functionality. As per Slack:

it looks like Glean crash rate collection via lib-crash is ready for consumption. The Reference Browser test is completed and we have a mini-dashboard to look at as an example (https://sql.telemetry.mozilla.org/queries/64067/source). Another heads-up, there was a probe_scraper change needed specifically for Reference Browser that needed to be made in order to notify it of the added dependency for lib-crash so that the pipeline would accept the data, so you may need to investigate whether this would also be needed for FFTV.

mcomella avatar Aug 07 '19 17:08 mcomella

Question: Is there a dashboard that displays this data, as part of the Glean crashping?

Alternative: can we download data from Sentry to get crashes per day? Sentry dashboard seems to only show "crashes per hour".

Sentry doesn't correlate crashes w/ sessions (e.g. percentage of sessions w/o a crash, or whether a single crash is crashing a little over time).

liuche avatar Aug 08 '19 21:08 liuche

Question: Is there a dashboard that displays this data, as part of the Glean crashping?

I'm afraid there isn't one available, right now.

There is this query for showing the crash stats for the &-browser, which could be forked for FFTV. @fbertsch , did you end up filing a bug for creating this dashboard? If you didn't, what's the component @liuche could file a bug in?

Important note: the Glean SDK is not sending a crash ping. Is lib-crash that is sending 2 crash metrics as part of the metrics ping in the Glean SDK.

Dexterp37 avatar Aug 09 '19 07:08 Dexterp37

Glean has a crash reporter that can easily be added to CrashReporter services.

https://github.com/mozilla-mobile/android-components/tree/master/components/lib/crash#sending-crash-reports-to-glean

Some suggested Acceptance Criteria

  • Glean Crash reporter is added, and crash counts are being sent
  • Documentation added to FFTV telemetry docs, explaining what the crash reporter collects
  • Data review for adding new telemetry collection of Glean crash reporter

liuche avatar Sep 14 '19 00:09 liuche

Originally I thought that this would just be a quick change to start using the Glean SDK crash count ping in the metrics ping, and update the documentation, but it looks like we'd need to make a permissions bump in order to that, after talking w/ @mcomella .

I don't think that adding this ping is worth forcing a permissions bump - this still wouldn't give us a non-native exceptions count (bc this wouldn't collect native WebView crashes), so this data would match what we have in Sentry (and just be a comparison point).

@athomasmoz does that align with what you're thinking? This would not give us much new data, but risks getting users stuck on an old version due to the permissions bump.

liuche avatar Oct 01 '19 00:10 liuche

Originally I thought that this would just be a quick change to start using the Glean SDK crash count ping in the metrics ping, and update the documentation, but it looks like we'd need to make a permissions bump in order to that, after talking w/ @mcomella .

@mcomella @liuche what's the new required permission?

I don't think that adding this ping is worth forcing a permissions bump - this still wouldn't give us a non-native exceptions count (bc this wouldn't collect native WebView crashes), so this data would match what we have in Sentry (and just be a comparison point).

Please note that this would also account for "sideloads" of FFTV, if that's a thing on this platform. Sentry does not contain data about sideloads AFAIK.

@athomasmoz does that align with what you're thinking? This would not give us much new data, but risks getting users stuck on an old version due to the permissions bump.

Dexterp37 avatar Oct 01 '19 06:10 Dexterp37

@Dexterp37 when new permissions are added, users won't be automatically updated to the newest version of the app. This means that we may have users stuck on old versions, not getting security updates or new features.

@liuche agreed with your recommendation that it's not worth the risk

athomasmoz avatar Oct 02 '19 16:10 athomasmoz

@Dexterp37 when new permissions are added, users won't be automatically updated to the newest version of the app. This means that we may have users stuck on old versions, not getting security updates or new features.

Thanks for the clarification. I was interested in knowing what permission does this need and why does it need it: to understand if is something that could be relaxed or fixed on the lib-crash side of things :)

Dexterp37 avatar Oct 02 '19 17:10 Dexterp37

It looks like FFTV already uses 2 of the 3 permissions that lib-crash uses, it would appear that the adding of the FOREGROUND_SERVICE permission is the one that they are concerned about.

What I would like to know is that since this permission isn't added until API 28, whether or not it affects this since I think the only API's used here are 22 (FireOS 5) and 25 (FireOS 6). Also, since this permission is listed as a permission level of "normal" rather than "dangerous" it shouldn't require the prompt to the user. This may be due to my lack of understanding of how this is handled in the Amazon ecosystem.

travis79 avatar Oct 02 '19 17:10 travis79

@travis79 We're still unsure if adding permissions that aren't tracked by the system forces users to explicitly update the app on the versions we're concerned about.

However, I think we can workaround this concern entirely by using the <uses-permission-sdk-23 tag – which only declares the permission for SDK 23+ – instead of the <uses-permission-sdk tag.

mcomella avatar Oct 02 '19 18:10 mcomella

Thanks @mcomella! I honestly didn't know how it should work so I appreciate the extra information. This is something that should be added to the documentation of lib-crash, for cases such as this.

travis79 avatar Oct 02 '19 18:10 travis79

Hi @liuche what's the latest on this one? Can I move it off the board and into the backlog until we're ready to pick it up?

athomasmoz avatar Nov 08 '19 19:11 athomasmoz

Yes! let's do that.

liuche avatar Nov 08 '19 22:11 liuche