[Bug][Grafana] DORADetails-ChangeFailureRate Change Failure Rate Panel SQL incorrect calculation
Search before asking
- [x] I had searched in the issues and found no similar issues.
What happened
The "1. Change Failure Rate" panel on the "DORADetails-ChangeFailureRate" dashboard is displaying over 100%
What do you expect to happen
I'd expect the percentage to never exceed 100%
How to reproduce
Select more than one project from the DORA dashboard, then view the "1. Change Failure Rate" panel on the "DORADetails-ChangeFailureRate" dashboard
Anything else
We're noticing the bug with 13 projects selected from the project list. It's not clear to me if this is an aggregation issue with 13 projects, when the dashboard is expecting one or if the SQL query should take this into account. It appears something similar happened before in https://github.com/apache/incubator-devlake/issues/4610
Version
v1.0.1
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
The issue with the "Change Failure Rate" panel displaying a rate over 100% when multiple projects are selected is due to the SQL calculation not accounting for the distinct separation of deployments across different projects. The current query aggregates incidents and deployments without distinguishing between projects, leading to an inflated failure rate.
To correct this, the SQL query should be modified to calculate the change failure rate separately for each project and then aggregate the results. This can be achieved by including the project name in the GROUP BY clause and adjusting the calculation accordingly. Here is a revised version of the SQL query:
-- Metric 3: change failure rate
with _deployments as (
SELECT
cdc.cicd_deployment_id as deployment_id,
max(cdc.finished_date) as deployment_finished_date,
pm.project_name
FROM
cicd_deployment_commits cdc
JOIN project_mapping pm on cdc.cicd_scope_id = pm.row_id
and pm.`table` = 'cicd_scopes'
WHERE
pm.project_name in (${project})
and cdc.result = 'SUCCESS'
and cdc.environment = 'PRODUCTION'
GROUP BY
1, 3
HAVING
$__timeFilter(max(cdc.finished_date))
),
_failure_caused_by_deployments as (
SELECT
d.deployment_id,
d.deployment_finished_date,
d.project_name,
count(distinct case when i.id is not null then i.id end) as has_incident
FROM
_deployments d
left join project_incident_deployment_relationships pim on d.deployment_id = pim.deployment_id
left join incidents i on pim.id = i.id
GROUP BY
1, 2, 3
)
SELECT
d.project_name,
sum(has_incident) / count(deployment_id) as "change_failure_rate"
FROM
_failure_caused_by_deployments d
GROUP BY
d.project_name
This modification ensures that the change failure rate is calculated per project, preventing the rate from exceeding 100% when multiple projects are selected [1][2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
@dosu - Can you tell me if this is truly a bug with the SQL query or if the dashboard SQL query is working as designed?
The issue with the "Change Failure Rate" panel displaying a rate over 100% when multiple projects are selected is indeed related to the SQL query. The current query aggregates incidents and deployments without distinguishing between projects, which can lead to an inflated failure rate when multiple projects are selected. This suggests that the query is not fully accounting for the distinct separation of deployments across different projects, which is why the rate can exceed 100%.
Therefore, it can be considered a bug in the SQL query design, as it does not handle the aggregation correctly when multiple projects are involved. The query should be adjusted to calculate the change failure rate separately for each project and then aggregate the results to ensure the rate does not exceed 100% when multiple projects are selected [1][2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
@klesh - Curious if this issue has been discussed and/or in the queue to be worked on.
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.
This is still an issue.
@Startrekzky Would you like to take a look when you find time?
@jeffschaper I can't reproduce this bug on my local env.
Could you
- query the temp table
_failure_caused_by_deploymentsand check if there are records withhas_incident > 1? - If so, please change the query to below and try it out.
-- Metric 3: change failure rate
with _deployments as (
SELECT
cdc.cicd_deployment_id as deployment_id,
max(cdc.finished_date) as deployment_finished_date,
pm.project_name
FROM
cicd_deployment_commits cdc
JOIN project_mapping pm on cdc.cicd_scope_id = pm.row_id
and pm.`table` = 'cicd_scopes'
WHERE
pm.project_name in (${project})
and cdc.result = 'SUCCESS'
and cdc.environment = 'PRODUCTION'
GROUP BY
1, 3
HAVING
$__timeFilter(max(cdc.finished_date))
),
_failure_caused_by_deployments as (
SELECT
d.deployment_id,
d.deployment_finished_date,
d.project_name,
MAX(case when i.id is not null then 1 else 0 end) as has_incident
FROM
_deployments d
left join project_incident_deployment_relationships pim on d.deployment_id = pim.deployment_id
left join incidents i on pim.id = i.id
GROUP BY
1, 2, 3
)
SELECT
d.project_name,
sum(has_incident) / count(deployment_id) as "change_failure_rate"
FROM
_failure_caused_by_deployments d
GROUP BY
d.project_name
The only change is the line MAX(case when i.id is not null then 1 else 0 end) as has_incident. If this could solve your problem, feel free to put up a PR to address this. Don't forget to update the SQL of 'Change Failure Rate' in the DORA dashboard as well.
@Startrekzky - This does seem to work.
@jeffschaper Cool. I'll put up a PR.
Hi @jeffschaper, I found that there's something wrong with the above SQL when I tried to put up the PR. Could you try this one out, please?
-- Metric 3: change failure rate
with _deployments as (
-- When deploying multiple commits in one pipeline, GitLab and BitBucket may generate more than one deployment. However, DevLake consider these deployments as ONE production deployment and use the last one's finished_date as the finished date.
SELECT
cdc.cicd_deployment_id as deployment_id,
max(cdc.finished_date) as deployment_finished_date
FROM
cicd_deployment_commits cdc
JOIN project_mapping pm on cdc.cicd_scope_id = pm.row_id
and pm.`table` = 'cicd_scopes'
WHERE
pm.project_name in (${project})
and cdc.result = 'SUCCESS'
and cdc.environment = 'PRODUCTION'
GROUP BY
1
HAVING
$__timeFilter(max(cdc.finished_date))
),
_failure_caused_by_deployments as (
-- calculate the number of incidents caused by each deployment
SELECT
d.deployment_id,
d.deployment_finished_date,
-- count(distinct i.id) as has_incident
count(distinct case when i.id is not null then i.id end) as has_incident
FROM
_deployments d
left join project_incident_deployment_relationships pim on d.deployment_id = pim.deployment_id
left join incidents i on pim.id = i.id
GROUP BY
1,
2
)
SELECT
sum(has_incident) / count(deployment_id) as "change_failure_rate"
FROM
_failure_caused_by_deployments
Hi @jeffschaper, how was the above SQL? Does it work?
Hey @Startrekzky - I think we can close this issue. I'll reopen it or reference it in a new issue if needed. Thanks for your attention to this!
@jeffschaper OK