Process around Error logs (Production issues: root cause analysis + reduction of noise)

Open techsmyth opened this issue 1 year ago • 0 comments

Description

At the moment errors being tracked by the logging tools we are using: Elastic and Sentry. We need a clear process for addressing the issues proactively, even if the users didn't report them.

Initiative / goal

By having a process for issues tracking / investigation / prioritization - we are going to have efficient way of addressing production issues that are experienced by users.

Hypothesis

Measure for success will be the following attributes:
Fast investigation and less effort when a new issue appear
Issues experienced by users addressed proactively
For "elastics" logs apply alerting on new type of errors

Acceptance criteria and must have scope

[ ] Access: Whole dev team have access to and use the logs tools
[ ] Add a unique identifier for each mutation, that is used:
Client exceptions to have the unique identifier embedded in the text message
Elastic error submission to have the error code included
[ ] Identify what additional information needs to be logged (on-going) + raise issue s/ do it
[ ] Defined the process regarding checking the logs
After root cause/ reproduction steps are identified, bug / story is created
Documented approach for different types of errors
[ ] Only people in EU / EU national can access the raw data
[ ] Time for 3 sprints to run the process + then evaluate

Stakeholders

Development and Product team

Timeline

Access to Elastics / Sentry / Prod env / Prod db

Timeline: Q1 20224

Former description of this epic

Description

Ensuring that all errors on production, whether client or server, go through a root cause analysis Firefighting...to get the error counts way down so that there is better information from errors that do happen

Initiative / goal

Quality Usability

Hypothesis

All errors on production potentially are costing us users

Acceptance criteria and must have scope

[ ] Server error logs are analyzed and issues seen there are tracked down (data / bugs on client etc)
[ ] Client errors are tracked down on Sentry
[ ] Issues are raised where needed to fix the under lying cause
[ ] Triggers are in place to inform development / support
[ ] Initial addressing of issues identified, together with improvements to make tracking some errors more tractable

Note: longer term effort also required to structurally improve exception handling in the cluster / server

Stakeholders

Product

Timeline

asap

Additional context

There are a significant number of on-going errors being logged in the server, some of which are serious but we are simply not paying enough attention to them. Ditto on the client.

Sep 24 '23 11:09 techsmyth

alkemio alkemio copied to clipboard

Process around Error logs (Production issues: root cause analysis + reduction of noise)

Description

Initiative / goal

Hypothesis

Acceptance criteria and must have scope

Stakeholders

Timeline

Former description of this epic

Description

Initiative / goal

Hypothesis

Acceptance criteria and must have scope

Stakeholders

Timeline

Additional context

alkemio
alkemio copied to clipboard