alkemio
alkemio copied to clipboard
Process around Error logs (Production issues: root cause analysis + reduction of noise)
Description
At the moment errors being tracked by the logging tools we are using: Elastic and Sentry. We need a clear process for addressing the issues proactively, even if the users didn't report them.
Initiative / goal
By having a process for issues tracking / investigation / prioritization - we are going to have efficient way of addressing production issues that are experienced by users.
Hypothesis
- Measure for success will be the following attributes:
- Fast investigation and less effort when a new issue appear
- Issues experienced by users addressed proactively
- For "elastics" logs apply alerting on new type of errors
Acceptance criteria and must have scope
- [ ] Access: Whole dev team have access to and use the logs tools
- [ ] Add a unique identifier for each mutation, that is used:
- Client exceptions to have the unique identifier embedded in the text message
- Elastic error submission to have the error code included
- [ ] Identify what additional information needs to be logged (on-going) + raise issue s/ do it
- [ ] Defined the process regarding checking the logs
- After root cause/ reproduction steps are identified, bug / story is created
- Documented approach for different types of errors
- [ ] Only people in EU / EU national can access the raw data
- [ ] Time for 3 sprints to run the process + then evaluate
Stakeholders
Development and Product team
Timeline
Access to Elastics / Sentry / Prod env / Prod db
Timeline: Q1 20224
Former description of this epic
Description
Ensuring that all errors on production, whether client or server, go through a root cause analysis Firefighting...to get the error counts way down so that there is better information from errors that do happen
Initiative / goal
Quality Usability
Hypothesis
All errors on production potentially are costing us users
Acceptance criteria and must have scope
- [ ] Server error logs are analyzed and issues seen there are tracked down (data / bugs on client etc)
- [ ] Client errors are tracked down on Sentry
- [ ] Issues are raised where needed to fix the under lying cause
- [ ] Triggers are in place to inform development / support
- [ ] Initial addressing of issues identified, together with improvements to make tracking some errors more tractable
Note: longer term effort also required to structurally improve exception handling in the cluster / server
Stakeholders
Product
Timeline
asap
Additional context
There are a significant number of on-going errors being logged in the server, some of which are serious but we are simply not paying enough attention to them. Ditto on the client.