thanos icon indicating copy to clipboard operation
thanos copied to clipboard

[Bug] Gaps in sum and avg aggregations when joining histogram quantile with pod labels

Open velavokr opened this issue 8 months ago • 0 comments

https://github.com/thanos-io/thanos/issues/2736#issuecomment-2171810584

Thanos, Prometheus and Golang version used: thanos helm chart: 15.0.5 kube-prometheus-stack helm chart: 57.2.0 Chart.yaml and values.yaml for them: https://gist.github.com/velavokr/e8410555385db7bc9b1a1c184fe99b72

Object Storage Provider: s3

What happened: A problem similar to one described here.

After joining the metric with kube_pod_labels to filter by an additional label sum and avg aggregations started making gaps. Other aggregations (max and min) still have no gaps though.

How it happens:

  1. Have one pod under small but constant load (5rps) and emitting bucket counter metrics. The metrics are collected by prometheus. There's no gap in the aggregated metric.
  2. Wind up a second, similar pod and start putting way more rps on it (100), so its bucket counters are growing 20 times faster.
  3. Stop applying load on the second pod. Its bucket counters are not growing anymore. The first pod is still loaded with 5rps.
  4. A new gap starts in the aggregated metric.
  5. Delete the second pod.
  6. The aggregated metric still has a gap before the time the second pod was deleted but no more gap after the time.

The PromQL expression used: avg(histogram_quantile(0.5, sum(rate(my_latency_bucket[$__rate_interval])) by (le,pod)) * on(pod) group_right(le) kube_pod_labels{my_pod_label="my_pod_label_value"})

Here is the unaggregated metric. There are a few pods coming and going: Screenshot from 2024-06-16 21-50-06

Here is the result of min. No gaps, as expected: Screenshot from 2024-06-16 21-49-27

And here is the result of avg. Notice the gaps: Screenshot from 2024-06-16 21-49-43

What you expected to happen: No gaps in aggregations if there are no gaps in the underlying metrics.

How to reproduce it (as minimally and precisely as possible): I'm unsure.

Full logs to relevant components: Cannot find the logs relevant to the expression evaluated.

Anything else we need to know:

velavokr avatar Jun 17 '24 04:06 velavokr