application-services icon indicating copy to clipboard operation
application-services copied to clipboard

Better error telemetry

Open bendk opened this issue 3 years ago • 1 comments

Recently, we've been discussing how to improve how we use Glean to capture error telemetry. After taking a quick survey of our current error telemetry, I think our current code is actually close to what we want, we just need a few tweaks.

I think we should take the logins android error handling as a starting point:

  • In metrics.yaml we define several metrics:
    • Total write query count
    • Total read query count
    • Read query error counts (this is a labeled counter, so we can create a count for each error type)
    • Write query error counts (also a labeled counter)
  • In DatabaseLoginsStorage.kt we increment those counters
  • Finally we graph the errors on our logins dashboard

I think we can get pretty good error telemetry with a few tweaks:

  • Better metrics.
    • I don't think the read/write distinction is that useful, what if we replace that with just a total query count?
    • We use labeled counters to track error types, but we don't use that in the graph. What if we:
      • Combined read_query_error_count and write_query_error_count into a single errors_by_type labeled counter.
      • Visualize that on the dashboard as errors per day, grouped by type, like we do with sync errors
      • Improve the error type detection. Right now, almost all errors are grouped under the __other__ type. Getting access to Glean from Rust would be great, since then we could have this code in Rust.
    • Add an errors_by_function labeled counter and visualize that in a similar way. This could help us track down which function was generating errors. It seems like an improvement on the read/write distinction to me.
  • Use this system for other components as well.
    • I think this means a bunch of similar metrics.yaml files per-component
    • Maybe the upcoming struct metric could reduce the duplication?
    • Create a global errors_by_component labeled counter. This would provide a nice overview for our main dashboard.
  • Create a shared metric system. We should track the same metrics on iOS (and desktop once we're there). Getting access to Glean from Rust would be a huge help here too.

┆Issue is synchronized with this Jira Task

bendk avatar Jun 01 '22 20:06 bendk

I think the current code is close, but I don't think we should immediately rush to try to implement this for a couple reasons:

  • we should finish improving our Sentry errors first
  • if we wait until Glean is available on Rust then things get much easier.

bendk avatar Jun 01 '22 20:06 bendk

Moved to bugzilla: https://bugzilla.mozilla.org/show_bug.cgi?id=1866357

Change performed by the Move to Bugzilla add-on.

mhammond avatar Nov 23 '23 22:11 mhammond