pint icon indicating copy to clipboard operation
pint copied to clipboard

Add `promql/gauge_staleness` check

Open wbollock opened this issue 1 year ago • 9 comments

Frequently we make nits to users about using query functions when alerting on gauges:

https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/

I believe this would be a nice addition to Pint and Prometheus rule linting. Based on the counter check it seems like we could also find if a metric is a gauge and not covered by a query function: https://github.com/cloudflare/pint/blob/63fb30627bd644f2c3ed86e073bcae98badf6a56/internal/checks/promql_counter.go#L51

I'm happy to contribute this back to Pint but want to discuss with the maintainer if this would be acceptable

cc: @prymitive

wbollock avatar Apr 29 '24 19:04 wbollock

I don't think I follow. Can you give a more concrete example? By Frequently we make nits to users about using query functions when alerting on gauges do you mean when someone uses up{job="jobname"} < 1 tell them to use avg_over_time(up{job="jobname"}[5m]) < 0.9 instead?

prymitive avatar Jun 04 '24 09:06 prymitive

Yes that's a perfect example, exactly the scenario I have in mind. Let me know if this seems like a valid check :smile:

wbollock avatar Jun 04 '24 12:06 wbollock

But there is nothing wrong with up{job="jobname"} < 1, it's a perfectly valid alert query. If someone wants to write an alert for a single scrape failures that's the way to do it. avg_over_time(up{job="jobname"}[5m]) < 0.9 is something one can do instead, if they want to alert on a number of failures rather than a single one, but it's not like one is valid and the other one isn't. It all depends on what the user is trying to accomplish, which pint doesn't know, only the user. I don't think that adding checks that just give you an alternative way of doing something that might or might not be what you want is valuable. It just teaches pint users that if you see a warning or error then you just need to either ignore it or disable that check.

prymitive avatar Jun 04 '24 12:06 prymitive

Both queries are valid, but in my experience the author of up{job="jobname"} < 1 is never intentionally using it to alert if there is any staleness in addition to the scrape job being down. It's a common pitfall with the staleness handling changes outlined in the article, https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/.

A bare up is prone to the staleness handling changes and simpler weaker than if max_over_time was used to alert on any single scrape failure. It's not just an alternate way of doing things, but an improvement with no downsides at all:

This way gaps in our data resulting from failed scrapes and flaky time series will no longer result in false positives or negatives in our alerting

For example, up{job="jobname"} < 1 vs max_over_time(up{job="jobname"}[5m]) < 1. I envision the check as asking for any query function to cover the gauge, not specifically avg_over_time

wbollock avatar Jun 04 '24 13:06 wbollock

up isn't the best example since it's a special metric that is always present and doesn't need an _over_time function to be resilient.

A better example might be the exporter pattern of {service}_up metrics like mysql_up from the mysqld_exporter. I've seen multiple cases from multiple exporters (and even applications natively exposing metrics like Ceph) where there are silent failures when up stays as 1 but it stops returning all of its time series or intermittently returns them.

In such a case, an alert for up < 1 wouldn't fire and neither would one for ceph_osd_up < 1 (if it has a non-zero for duration, since it would be reset frequently and either flap a lot or never fire). Adding an _over_time function such as last_over_time would make the alert less fragile.

I don't even think that this is specific to gauges. An alert for a Summary metric could be fragile in the same way: prometheus_engine_query_duration_seconds{quantile="0.99"} > 1.

Histograms may be different since they should probably fall under the same existing promql/counter check 🤔.

Regardless, I think it would be useful to have a check similar to promql/fragile (or extend that?) to detect when an alert has a non-zero for duration AND isn't using range vector selector. Detecting a range vector selector rather than an _over_time function should be more flexible and easier to implement, I think.

Excerpt from Prometheus: Up & Running (2nd edition): image

wbh1 avatar Jun 10 '24 16:06 wbh1

@prymitive when you get a chance do you mind revisiting this discussion? its still a frequent issue with our Prometheus rules submitters :)

wbollock avatar Sep 19 '25 15:09 wbollock

I don't really understand what problem is pint suppose to find in these queries. You do NOT need to query all gauges via *_over_time(), you can simply query for the last value. That's perfectly valid and will work perfectly fine. And if you do that staleness markers are NOT your enemy or force you to complicate your queries. You only need *_over_time() functions if you care about the value in some time range, rather than the last value. And, very often, if you do care about the value over some time, like when checking if a target is down, simply writing:

- alert: Foo_Down
  expr: up{job="foo"} == 0
  for: 5m

is perfectly fine, simple, expressing and works as expected. So either you're talking about some niche case here or I'm missing something obvious.

prymitive avatar Sep 23 '25 09:09 prymitive

This very well may be a niche case. I agree that querying for just the last value is valid and works fine in most cases. This is also not specific to gauges — it's any metric exposed by a metric producer, but counters are more likely to be wrapped in a rate, increase, etc., and avoid this.

If a target fails any scrape and then subsequently succeeds, it resets the for duration of the alert.[^1]

Let's say your Prometheus is crash-looping. It dies a few seconds into WAL replay repeatedly, but is usually up for long enough to get scraped by another Prometheus. And say you have these alerts on that other Prometheus monitoring it:

- alert: PrometheusNotReady
  expr: prometheus_ready == 0
  for: 5m

- alert: PrometheusScrapeDown
  expr: up{job="prometheus"} == 0
  for: 5m

The alert for up is never going to fire because the scrape job is flapping up and down. However, PrometheusNotReady is also not going to fire because every time a scrape fails, it's going to reset every time a scrape fails.

Example gist of a promtool unit test for the alerting rules: https://gist.github.com/wbh1/0d29dea0c1b637325c4ca84a4bfb08e6

[^1]: It's also dependent on your scrape interval vs your rule evaluation interval. For example, if the target is scraped every 30s, the rule interval is 60s, and the scrapes only fail every-other-time, then the staleness markers are removed by the time the query evaluates and your approach works fine.

wbh1 avatar Sep 24 '25 19:09 wbh1

Sure, but what problem is pint suppose to find here? What is the error message pint should produce here any why?

IMHO everything here works as expected. This rule:

- alert: PrometheusNotReady
  expr: prometheus_ready == 0
  for: 5m

means: alert me if prometheus_ready is present and with value 0 for every evaluation in the last 5 minutes. Same for up. These are both valid alerts.

You might want to be alerted when prometheus is in a crash loop and you so you could write an alert for that. But it's not pint's job to enumerate all possible alerts you could write.

prymitive avatar Sep 25 '25 13:09 prymitive