alkemio
alkemio copied to clipboard
Client Errors - Improve observability and quality
Description
We need better observability of the client enabling us to track and fix user issues, set quality KPIs, and better categorize issues.
Goal
This initiative aims to positively impact our users' experience. Better detection and tracking of critical issues would lead to quicker focus and resolution. Setting quality KPIs will give visibility to the stakeholders of the client. The critical client errors should be resolved with priority.
Hypothesis
By utilizing and improving the Sentry integration (3rd party tool) and APM, we can better track and organize client errors. Once we better organize the errors and fix the critical ones, we can define specific KPI targets and set a monitoring schedule.
Must have scope
Structure the process around client observability:
[ ] Revise and improve Sentry logging. [ ] Revise APM, how we can use it in isolation or combination with Sentry. [ ] Set reasonable crash-free, performance, and unhandled errors KPIs. [ ] Set alerting, monitoring, and ownership (on critical errors, new errors, post-release, etc.).
Analysis of the issues being experienced, with recommendations of issues to be addressed:
[ ] Tag/categorize client errors (by severity and domain). [ ] Log the current critical errors. [ ] Log and fix or categorize the most common errors (to reduce the noise).
Optional:
[ ] Research what other features of Sentry could be utilized - Metrics, Replays, etc. - to better track user experience;
Next: [ ] Next epic planned in with heavier issues.
Here's a link to the initial challenge.
Stakeholders
@techsmyth @valentinyanakiev @me-andre @hero101 @Comoque1 @bobbykolev
Thanks for opening your first issue here! Be sure to follow the issue template!
We already have implementation for Level (fatal, error, warning, etc) in Sentry/log.ts. We could extend the same implementation to enhance the context with tags: https://docs.sentry.io/platforms/javascript/guides/react/enriching-events/tags/ This could help us detect functionality like Auth/Server/Callout etc. or other useful information that could help us track an error faster.
The following epic duplicates this one: https://app.zenhub.com/workspaces/alkemio-development-5ecb98b262ebd9f4aec4194c/issues/gh/alkem-io/alkemio/1291
@bobbykolev good catch, can you please merge the other epic into this one? So both description / information and also issues (if any / relevant still).
Done.