synthetic-monitoring-app icon indicating copy to clipboard operation
synthetic-monitoring-app copied to clipboard

fix: change query for uptime stat panel

Open VikaCep opened this issue 1 year ago • 2 comments

What this PR does / why we need it:

This PR fixes the query that is used to calculate the uptime stat panel for each check. As described in this escalation, this panel was displaying a value of 100% even when the probes had 100% errors reaching the target.

Which issue(s) this PR fixes:

Part of https://github.com/grafana/support-escalations/issues/11197. This solution addresses the first problem described here https://github.com/grafana/support-escalations/issues/11197#issuecomment-2195657543

VikaCep avatar Jun 27 '24 21:06 VikaCep

@mem I changed the query according to your comment's solution.

VikaCep avatar Jun 27 '24 21:06 VikaCep

I'm sorry, I just realized this is the query we had to fix for a different reason.

I think the query itself is correct. We had to add a transformation that collects the data for the entire range because Mimir has a limit on the amount of data that it's willing to retrieve.

I was confused because the "explore" link in the panel drops that transformation.

Let me take another looks.

mem avatar Jun 27 '24 21:06 mem

https://github.com/grafana/support-escalations/issues/11197#issuecomment-2218047905 -- update as of 2024/07/09

ckbedwell avatar Jul 17 '24 13:07 ckbedwell

I added a new uptime query version from @mem's proposal:

floor(
      # Report a 1 if there's a location where most observations were successful and 0 if most observations failed for all probes.
      max by (instance, job) (
        round(
          # the number of successes for each probe
          (increase(probe_all_success_sum{instance="$instance", job="$job"}[$__rate_interval]))
          /
          # the total number of times we checked for each probe
          ((increase(probe_all_success_count{instance="$instance", job="$job"}[$__rate_interval])))
        )
      )
    )

It is under a feature flag named uptime-query-v2. By default, we're using the original query.

VikaCep avatar Aug 23 '24 17:08 VikaCep