fix: change query for uptime stat panel
What this PR does / why we need it:
This PR fixes the query that is used to calculate the uptime stat panel for each check. As described in this escalation, this panel was displaying a value of 100% even when the probes had 100% errors reaching the target.
Which issue(s) this PR fixes:
Part of https://github.com/grafana/support-escalations/issues/11197. This solution addresses the first problem described here https://github.com/grafana/support-escalations/issues/11197#issuecomment-2195657543
@mem I changed the query according to your comment's solution.
I'm sorry, I just realized this is the query we had to fix for a different reason.
I think the query itself is correct. We had to add a transformation that collects the data for the entire range because Mimir has a limit on the amount of data that it's willing to retrieve.
I was confused because the "explore" link in the panel drops that transformation.
Let me take another looks.
https://github.com/grafana/support-escalations/issues/11197#issuecomment-2218047905 -- update as of 2024/07/09
I added a new uptime query version from @mem's proposal:
floor(
# Report a 1 if there's a location where most observations were successful and 0 if most observations failed for all probes.
max by (instance, job) (
round(
# the number of successes for each probe
(increase(probe_all_success_sum{instance="$instance", job="$job"}[$__rate_interval]))
/
# the total number of times we checked for each probe
((increase(probe_all_success_count{instance="$instance", job="$job"}[$__rate_interval])))
)
)
)
It is under a feature flag named uptime-query-v2. By default, we're using the original query.