fourkeys icon indicating copy to clipboard operation
fourkeys copied to clipboard

Question about the root cause of incidents

Open na-ga opened this issue 3 years ago • 0 comments

Question

I have a question about the root cause of incidents. fourkeys raises incidents as GitHub/GitLab issues. Is the following specification correct for the root cause that can be specified for incidents?

  1. specify the commit Id (example: root cause: 1bb56d92c5d95162f12a1ad08b8cfe736ee111e6)
  2. one root cause can be specified for one Incident (multiple root courses are not allowed)

Background

As a background to the question, in the schema of the incidents view, the changes field is of type array<string>, so multiple changes are considered to be associated with one incident.

The query that generates the incidents view uses REGEXP_EXTRACT to get the root cause, so the first string that matches the pattern is treated as the root cause.

Whenever we update the corresponding issue in any way, a Webhook request is sent and stored in the events_row table by the event handler. So, if the my question is correct, the changes field in the incidents view will be set as an array with the root cause for the number of times the corresponding issue has been updated.

As a result, the Metrics Change Failure Rate will be abnormally high. This is because the following query for calculating the metrics JOIN changes in the incidents view with changes in the deployments view, so the incident_id to be calculated is overloaded with duplicate changes in the incidents view.

SELECT
TIMESTAMP_TRUNC(d.time_created, DAY) as day,
SUM(IF(i.incident_id is NULL, 0, 1)) / COUNT(DISTINCT change_id) as change_fail_rate
FROM four_keys.deployments d, d.changes
LEFT JOIN four_keys.changes c ON changes = c.change_id
LEFT JOIN(SELECT
        incident_id,
        change,
        time_resolved
        FROM four_keys.incidents i,
        i.changes change) i ON i.change = changes
GROUP BY day;

If I understand correctly, the query that generates the incidents view works well by removing duplicate changes. For example, change it to the following.

# Incidents Table
SELECT
source,
incident_id,
MIN(IF(root.time_created < issue.time_created, root.time_created, issue.time_created)) as time_created,
MAX(time_resolved) as time_resolved,
ARRAY_AGG(DISTINCT root_cause IGNORE NULLS) changes, # Add distinct
... skip ...

On the other hand, there is still a problem with this. If you change the root cause of the issue in question, the previously specified root cause ties are not removed. If you have only a single root cause tied to an Incident, then dealing with the latest root cause will work fine. For example, change it to the following.

# Incidents Table
SELECT
source,
incident_id,
MIN(IF(root.time_created < issue.time_created, root.time_created, issue.time_created)) as time_created,
MAX(time_resolved) as time_resolved,
ARRAY_AGG(root_cause IGNORE NULLS ORDER BY issue.time_requested DESC LIMIT 1) changes, # Add order and limit
FROM
(
SELECT 
time_created AS time_requested, # Add webhook requested time
... skip ...

There are several proposed fixes, but I don't know the root cause specification for Incident. Please advise me.

na-ga avatar Sep 14 '21 13:09 na-ga