pyrra icon indicating copy to clipboard operation
pyrra copied to clipboard

Nested recording rule generation for different SLO time spans?

Open GrgDev opened this issue 1 year ago • 1 comments

I have a use case where we have a ton of metrics data. We want to do SLOs up to a 4 week period, but having those queries regularly query 4 weeks worth of raw data is very compute heavy. It would be nice if the generated recording rules for longer time span SLO queries were based on shorter term recording rule time spans. For example, making the generated 4 week SLO recording rule based on the 1 day SLO recording rule instead of the raw data.

Is this possible with the current implementation of Pyrra?

GrgDev avatar Aug 22 '24 17:08 GrgDev

This sounds like Thanos downsampling might be worth looking into. Right now, Pyrra's recording rules take into account all of the 4w of data. That's how the recording rule works. I think there was a discussion at some point to split the 4W queries up, but I can't find it right now, and I don't think it went anywhere.

metalmatze avatar Aug 23 '24 15:08 metalmatze

I wanted to update here with a finding about a workaround I found to get it to work the way I wanted to in this request. This might have been what metalmatze had meant in the first place with Thanos downsampling, but I didn't understand that if that was the case. It's not quite the exact same as my feature request, but it is nested recording rules and it did fix the high cardinality compute problem. In case it makes any difference, I am using this with Mimir instead of Prometheus.

I did some quick testing to see if the generated SLO recording rules generally had the same data as the SLO recording rules that was using the raw data based metric. So far what I found was that I can get the data to be acceptably similar enough if the base recording rule followed a couple of rules.

  1. The underlying metric for the base recording rule cannot use a rate() function as that would cause some generated recording rules to switch to taking a rate of a rate unintentionally.
  2. For the main use cases I have looked at so far, you would most likely want your base recording rule to use a sum() aggregate function.
  3. If the underlying metric for the base recording rule uses a selector label that matches on a regex instead of an exact string match, the underlying metric has to preserve that label in the resulting recording rule metric series, such as adding a by clause in your aggregation function such as sum by (label_name).

Here's an example. I have a metric called failures_total and another metric requests_total. Let's say these metrics have very high cardinality, so using them directly for the Pyrra generated increase4w recording rule puts unacceptable stress on prometheus/mimir. To get around this, we can define recording rules like this. Then the regex selector(s) need to be applied again on the recording rule in the Pyrra SLO definition metric.

groups:
  - name: example
    rules:
    - record: my_service:failures_total:sum
      expr: sum(failures_total{app="my_service", status_code=~"5.."}) by (status_code)
    - record: my_service:requests_total:sum
      expr: sum(requests_total{app="my_service"}) by (status_code)

Note how I followed my rules from above.

  • I did not use a rate() function.
  • I did use a sum() function.
  • The sum() function has a by (status_code) clause instead of by (app, status_code) because only the status_code label is matching by regex.

Then when I go to define my SLOs for Pyrra, I do:

  target: "99"
  window: 4w
  description: The request error rate for my service.
  indicator:
    ratio:
      errors:
        metric: my_service:failures_total:sum{status_code=~"5.."}
      total:
        metric: my_service:requests_total:sum

In my testing, this has so far worked for me. I don't have 4 weeks worth of data in the recording rules yet so we'll see for sure once I get there, but so far the data seems the same.

I do not know for sure if the by clause being needed for labels with regex match operators is truly required. I just know that when I defined SLOs with such a regex match operator before on the raw metrics, Pyrra generated recording rules added that by clause in them. I was mostly concerned with ensuring that I got the same behavior and data from Pyrra that I did before. It's possible that it gets summed up in the end and therefore doesn't really matter.

GrgDev avatar Apr 21 '25 16:04 GrgDev

Thanks for the detailed write-up. If I understand your proposal correctly, I unfortunately think we cannot do that. Doing the my_service:requests_total:sum recording rule as sum(requests_total{app="my_service"}) by (status_code) then we'll end up in the situation where we sum(rate(sum(metric[5m]))) in the UI. The UI is going to sum by (code) (rate(my_service:requests_total:sum[5m])) > 0 which is incorrect.

This is a bug in the Grafana dashboards with the generic rules we've long needed to fix.

metalmatze avatar May 24 '25 15:05 metalmatze

I'll close this for now. We can discuss it some more but unless we have a good solution for this, we'll keep it closed.

metalmatze avatar May 24 '25 15:05 metalmatze