prometheus-engine Exported aggregated metrics via recording rules have the wrong type

When metrics are exported that are generated with a simple recording rule to aggregate away some labels like:

        - record: "workload:istio_requests_total"
          expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)"

The resulting metric in GCP Monitoring ends up as a gauge type but should be a counter. This seems to apply to any metric out of a recording rule, version is v2.35.0-gmp.5-gke.0.

Jun 30 '23 19:06 aniekgul

Thanks for the report and great observation!

We intentionally save all rules as a gauge, which can be confusing.

Are there any functional reasons you prefer a counter?

Jul 06 '23 20:07 TheSpiritXIII

It was pointed out to me by @pintohutch (thanks!) that Prometheus does not internally store type for recording rules, so it is not possible for GMP to deduce them:

The Prometheus server does not yet make use of the type information and flattens all data into untyped time series. This may change in the future

We actually double-write other unknown metrics as both gauges and counters but chose not to do that here because it comes with a cost (higher bills 💸 😱).

Let me know if you have any use-cases!

Jul 06 '23 20:07 TheSpiritXIII

Thanks for the report and great observation!

We intentionally save all rules as a gauge, which can be confusing.

Are there any functional reasons you prefer a counter?

The main use case is that rules are used for aggregation of all types of metric types counters, gauges, etc. And certain metrics are more suited for a counter type like a count of http requests for example. After aggregation via rules they become gauges which now changes how that data behaves so if you were doing a query like rate(http_requests[5m]) to get the rate of requests from a counter that breaks when it becomes a gauge.

Jul 07 '23 14:07 aniekgul

Counters should not be aggregated. For example, see this article: https://www.robustperception.io/rate-then-sum-never-sum-then-rate/

This is also explained in the documentation for rate(), which already assumes input is a counter and not a gauge so you would have the same issue if the metric type was a counter.

Jul 07 '23 16:07 TheSpiritXIII

Ah by aggregation I meant more so in the act of removing labels and reducing series cardinality rather than aggregation in querying, couple of rules for example

  - record: "workload:istio_requests_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)

  - record: "workload:istio_request_duration_milliseconds_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)

  - record: "workload:istio_request_duration_milliseconds_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)

They strip out unnecessary labels from metrics while retaining their behavior as counters so doing queries on istio_requests_total or workload:istio_requests_total locally in prometheus gives the same results but querying workload:istio_requests_total in GMP gives odd results due to it being treated as a gauge

Jul 07 '23 16:07 aniekgul

Hi 👋🏽

Unfortunately all of those recording rules you show lead to broken counters as explained to https://www.robustperception.io/rate-then-sum-never-sum-then-rate/. This is because you use aggregate (see sum operation) that will be used if series is duplicated after label removal (which is your point - you want remove number of unique series). The duplicated series will be sum-ed and if the duplicated series started at different times (likely the case), it produces incorrect counters.

The only valid rule here is sum without (...) rate(...[<number of minutes/h you want rate for, typically 5m>]). There is also an argument that recording rules are not ideal for reducing cardinality, because depending where you do them (locally vs globally) the data was already ingested.

We have to work with the community to improve client-aggregation techniques, but currently the best way is to adjust/reconfigure the source of metrics (instrumentation).

Jul 07 '23 18:07 bwplotka

Yeah the rate calculation makes sense, here's the full context of how I came to notice this.

Initially I was trying to use federation with local instances of prometheus containing high cardinality data and a central gmp prometheus pulling in aggregated metrics but federation isnt supported by gmp so I tried just using local aggregation in the gmp prometheus and that results in those metrics being gauges. I'll probably just have to look into configuration at the metrics source then.

Jul 07 '23 19:07 aniekgul

prometheus-engine prometheus-engine copied to clipboard

Exported aggregated metrics via recording rules have the wrong type

prometheus-engine
prometheus-engine copied to clipboard