thanos
thanos copied to clipboard
[Bug] Gaps in sum and avg aggregations when joining histogram quantile with pod labels
https://github.com/thanos-io/thanos/issues/2736#issuecomment-2171810584
Thanos, Prometheus and Golang version used:
thanos helm chart: 15.0.5
kube-prometheus-stack helm chart: 57.2.0
Chart.yaml
and values.yaml
for them: https://gist.github.com/velavokr/e8410555385db7bc9b1a1c184fe99b72
Object Storage Provider: s3
What happened: A problem similar to one described here.
After joining the metric with kube_pod_labels
to filter by an additional label sum
and avg
aggregations started making gaps. Other aggregations (max
and min
) still have no gaps though.
How it happens:
- Have one pod under small but constant load (5rps) and emitting bucket counter metrics. The metrics are collected by prometheus. There's no gap in the aggregated metric.
- Wind up a second, similar pod and start putting way more rps on it (100), so its bucket counters are growing 20 times faster.
- Stop applying load on the second pod. Its bucket counters are not growing anymore. The first pod is still loaded with 5rps.
- A new gap starts in the aggregated metric.
- Delete the second pod.
- The aggregated metric still has a gap before the time the second pod was deleted but no more gap after the time.
The PromQL expression used:
avg(histogram_quantile(0.5, sum(rate(my_latency_bucket[$__rate_interval])) by (le,pod)) * on(pod) group_right(le) kube_pod_labels{my_pod_label="my_pod_label_value"})
Here is the unaggregated metric. There are a few pods coming and going:
Here is the result of min
. No gaps, as expected:
And here is the result of avg
. Notice the gaps:
What you expected to happen: No gaps in aggregations if there are no gaps in the underlying metrics.
How to reproduce it (as minimally and precisely as possible): I'm unsure.
Full logs to relevant components: Cannot find the logs relevant to the expression evaluated.
Anything else we need to know: