alkemio icon indicating copy to clipboard operation
alkemio copied to clipboard

Process around Error logs (Production issues: root cause analysis + reduction of noise)

Open techsmyth opened this issue 1 year ago • 0 comments

Description

At the moment errors being tracked by the logging tools we are using: Elastic and Sentry. We need a clear process for addressing the issues proactively, even if the users didn't report them.

Initiative / goal

By having a process for issues tracking / investigation / prioritization - we are going to have efficient way of addressing production issues that are experienced by users.

Hypothesis

  • Measure for success will be the following attributes:
  • Fast investigation and less effort when a new issue appear
  • Issues experienced by users addressed proactively
  • For "elastics" logs apply alerting on new type of errors

Acceptance criteria and must have scope

  • [ ] Access: Whole dev team have access to and use the logs tools
  • [ ] Add a unique identifier for each mutation, that is used:
  • Client exceptions to have the unique identifier embedded in the text message
  • Elastic error submission to have the error code included
  • [ ] Identify what additional information needs to be logged (on-going) + raise issue s/ do it
  • [ ] Defined the process regarding checking the logs
  • After root cause/ reproduction steps are identified, bug / story is created
  • Documented approach for different types of errors
  • [ ] Only people in EU / EU national can access the raw data
  • [ ] Time for 3 sprints to run the process + then evaluate

Stakeholders

Development and Product team

Timeline

Access to Elastics / Sentry / Prod env / Prod db

Timeline: Q1 20224

Former description of this epic

Description

Ensuring that all errors on production, whether client or server, go through a root cause analysis Firefighting...to get the error counts way down so that there is better information from errors that do happen

Initiative / goal

Quality Usability

Hypothesis

All errors on production potentially are costing us users

Acceptance criteria and must have scope

  • [ ] Server error logs are analyzed and issues seen there are tracked down (data / bugs on client etc)
  • [ ] Client errors are tracked down on Sentry
  • [ ] Issues are raised where needed to fix the under lying cause
  • [ ] Triggers are in place to inform development / support
  • [ ] Initial addressing of issues identified, together with improvements to make tracking some errors more tractable

Note: longer term effort also required to structurally improve exception handling in the cluster / server

Stakeholders

Product

Timeline

asap

Additional context

There are a significant number of on-going errors being logged in the server, some of which are serious but we are simply not paying enough attention to them. Ditto on the client.

techsmyth avatar Sep 24 '23 11:09 techsmyth