Better error telemetry
Recently, we've been discussing how to improve how we use Glean to capture error telemetry. After taking a quick survey of our current error telemetry, I think our current code is actually close to what we want, we just need a few tweaks.
I think we should take the logins android error handling as a starting point:
- In
metrics.yamlwe define several metrics:- Total write query count
- Total read query count
- Read query error counts (this is a labeled counter, so we can create a count for each error type)
- Write query error counts (also a labeled counter)
- In
DatabaseLoginsStorage.ktwe increment those counters - Finally we graph the errors on our logins dashboard
I think we can get pretty good error telemetry with a few tweaks:
- Better metrics.
- I don't think the read/write distinction is that useful, what if we replace that with just a total query count?
- We use labeled counters to track error types, but we don't use that in the graph. What if we:
- Combined
read_query_error_countandwrite_query_error_countinto a singleerrors_by_typelabeled counter. - Visualize that on the dashboard as errors per day, grouped by type, like we do with sync errors
- Improve the error type detection. Right now, almost all errors are grouped under the
__other__type. Getting access to Glean from Rust would be great, since then we could have this code in Rust.
- Combined
- Add an
errors_by_functionlabeled counter and visualize that in a similar way. This could help us track down which function was generating errors. It seems like an improvement on the read/write distinction to me.
- Use this system for other components as well.
- I think this means a bunch of similar
metrics.yamlfiles per-component - Maybe the upcoming
structmetric could reduce the duplication? - Create a global
errors_by_componentlabeled counter. This would provide a nice overview for our main dashboard.
- I think this means a bunch of similar
- Create a shared metric system. We should track the same metrics on iOS (and desktop once we're there). Getting access to Glean from Rust would be a huge help here too.
┆Issue is synchronized with this Jira Task
I think the current code is close, but I don't think we should immediately rush to try to implement this for a couple reasons:
- we should finish improving our Sentry errors first
- if we wait until Glean is available on Rust then things get much easier.
Moved to bugzilla: https://bugzilla.mozilla.org/show_bug.cgi?id=1866357
Change performed by the Move to Bugzilla add-on.