incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Bug][Grafana] DORADetails-ChangeFailureRate Change Failure Rate Panel SQL incorrect calculation

Open jeffschaper opened this issue 11 months ago • 4 comments

Search before asking

  • [x] I had searched in the issues and found no similar issues.

What happened

The "1. Change Failure Rate" panel on the "DORADetails-ChangeFailureRate" dashboard is displaying over 100%

Image

What do you expect to happen

I'd expect the percentage to never exceed 100%

How to reproduce

Select more than one project from the DORA dashboard, then view the "1. Change Failure Rate" panel on the "DORADetails-ChangeFailureRate" dashboard

Anything else

We're noticing the bug with 13 projects selected from the project list. It's not clear to me if this is an aggregation issue with 13 projects, when the dashboard is expecting one or if the SQL query should take this into account. It appears something similar happened before in https://github.com/apache/incubator-devlake/issues/4610

Version

v1.0.1

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

jeffschaper avatar Mar 26 '25 20:03 jeffschaper

The issue with the "Change Failure Rate" panel displaying a rate over 100% when multiple projects are selected is due to the SQL calculation not accounting for the distinct separation of deployments across different projects. The current query aggregates incidents and deployments without distinguishing between projects, leading to an inflated failure rate.

To correct this, the SQL query should be modified to calculate the change failure rate separately for each project and then aggregate the results. This can be achieved by including the project name in the GROUP BY clause and adjusting the calculation accordingly. Here is a revised version of the SQL query:

-- Metric 3: change failure rate
with _deployments as (
  SELECT
    cdc.cicd_deployment_id as deployment_id,
    max(cdc.finished_date) as deployment_finished_date,
    pm.project_name
  FROM
    cicd_deployment_commits cdc
    JOIN project_mapping pm on cdc.cicd_scope_id = pm.row_id
    and pm.`table` = 'cicd_scopes'
  WHERE
    pm.project_name in (${project})
    and cdc.result = 'SUCCESS'
    and cdc.environment = 'PRODUCTION'
  GROUP BY
    1, 3
  HAVING
    $__timeFilter(max(cdc.finished_date))
),
_failure_caused_by_deployments as (
  SELECT
    d.deployment_id,
    d.deployment_finished_date,
    d.project_name,
    count(distinct case when i.id is not null then i.id end) as has_incident
  FROM
    _deployments d
    left join project_incident_deployment_relationships pim on d.deployment_id = pim.deployment_id
    left join incidents i on pim.id = i.id
  GROUP BY
    1, 2, 3
)
SELECT
  d.project_name,
  sum(has_incident) / count(deployment_id) as "change_failure_rate"
FROM
  _failure_caused_by_deployments d
GROUP BY
  d.project_name

This modification ensures that the change failure rate is calculated per project, preventing the rate from exceeding 100% when multiple projects are selected [1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Mar 26 '25 20:03 dosubot[bot]

@dosu - Can you tell me if this is truly a bug with the SQL query or if the dashboard SQL query is working as designed?

jeffschaper avatar Mar 26 '25 20:03 jeffschaper

The issue with the "Change Failure Rate" panel displaying a rate over 100% when multiple projects are selected is indeed related to the SQL query. The current query aggregates incidents and deployments without distinguishing between projects, which can lead to an inflated failure rate when multiple projects are selected. This suggests that the query is not fully accounting for the distinct separation of deployments across different projects, which is why the rate can exceed 100%.

Therefore, it can be considered a bug in the SQL query design, as it does not handle the aggregation correctly when multiple projects are involved. The query should be adjusted to calculate the change failure rate separately for each project and then aggregate the results to ensure the rate does not exceed 100% when multiple projects are selected [1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Mar 26 '25 20:03 dosubot[bot]

@klesh - Curious if this issue has been discussed and/or in the queue to be worked on.

jeffschaper avatar Apr 30 '25 15:04 jeffschaper

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jun 30 '25 00:06 github-actions[bot]

This is still an issue.

jeffschaper avatar Jun 30 '25 14:06 jeffschaper

@Startrekzky Would you like to take a look when you find time?

klesh avatar Jul 07 '25 03:07 klesh

@jeffschaper I can't reproduce this bug on my local env.

Could you

  1. query the temp table _failure_caused_by_deployments and check if there are records with has_incident > 1?
  2. If so, please change the query to below and try it out.
-- Metric 3: change failure rate
with _deployments as (
  SELECT
    cdc.cicd_deployment_id as deployment_id,
    max(cdc.finished_date) as deployment_finished_date,
    pm.project_name
  FROM
    cicd_deployment_commits cdc
    JOIN project_mapping pm on cdc.cicd_scope_id = pm.row_id
    and pm.`table` = 'cicd_scopes'
  WHERE
    pm.project_name in (${project})
    and cdc.result = 'SUCCESS'
    and cdc.environment = 'PRODUCTION'
  GROUP BY
    1, 3
  HAVING
    $__timeFilter(max(cdc.finished_date))
),
_failure_caused_by_deployments as (
  SELECT
    d.deployment_id,
    d.deployment_finished_date,
    d.project_name,
    MAX(case when i.id is not null then 1 else 0 end) as has_incident
  FROM
    _deployments d
    left join project_incident_deployment_relationships pim on d.deployment_id = pim.deployment_id
    left join incidents i on pim.id = i.id
  GROUP BY
    1, 2, 3
)
SELECT
  d.project_name,
  sum(has_incident) / count(deployment_id) as "change_failure_rate"
FROM
  _failure_caused_by_deployments d
GROUP BY
  d.project_name

The only change is the line MAX(case when i.id is not null then 1 else 0 end) as has_incident. If this could solve your problem, feel free to put up a PR to address this. Don't forget to update the SQL of 'Change Failure Rate' in the DORA dashboard as well.

Startrekzky avatar Jul 08 '25 08:07 Startrekzky

@Startrekzky - This does seem to work.

jeffschaper avatar Jul 15 '25 20:07 jeffschaper

@jeffschaper Cool. I'll put up a PR.

Startrekzky avatar Jul 25 '25 09:07 Startrekzky

Hi @jeffschaper, I found that there's something wrong with the above SQL when I tried to put up the PR. Could you try this one out, please?

-- Metric 3: change failure rate
with _deployments as (
  -- When deploying multiple commits in one pipeline, GitLab and BitBucket may generate more than one deployment. However, DevLake consider these deployments as ONE production deployment and use the last one's finished_date as the finished date.
  SELECT
    cdc.cicd_deployment_id as deployment_id,
    max(cdc.finished_date) as deployment_finished_date
  FROM
    cicd_deployment_commits cdc
    JOIN project_mapping pm on cdc.cicd_scope_id = pm.row_id
    and pm.`table` = 'cicd_scopes'
  WHERE
    pm.project_name in (${project})
    and cdc.result = 'SUCCESS'
    and cdc.environment = 'PRODUCTION'
  GROUP BY
    1
  HAVING
    $__timeFilter(max(cdc.finished_date))
),
_failure_caused_by_deployments as (
  -- calculate the number of incidents caused by each deployment
  SELECT
    d.deployment_id,
    d.deployment_finished_date,
    -- count(distinct i.id) as has_incident
    count(distinct case when i.id is not null then i.id end) as has_incident
  FROM
    _deployments d
    left join project_incident_deployment_relationships pim on d.deployment_id = pim.deployment_id
    left join incidents i on pim.id = i.id
  GROUP BY
    1,
    2
)
SELECT
  sum(has_incident) / count(deployment_id) as "change_failure_rate"
FROM
  _failure_caused_by_deployments

Startrekzky avatar Aug 08 '25 02:08 Startrekzky

Hi @jeffschaper, how was the above SQL? Does it work?

Startrekzky avatar Sep 19 '25 08:09 Startrekzky

Hey @Startrekzky - I think we can close this issue. I'll reopen it or reference it in a new issue if needed. Thanks for your attention to this!

jeffschaper avatar Sep 22 '25 14:09 jeffschaper

@jeffschaper OK

Startrekzky avatar Sep 24 '25 18:09 Startrekzky