mimir Instant query time splitting can cause different query results using rate/delta/increase on stale metrics

Instant query time splitting can cause different query results using rate/delta/increase on stale metrics

Open pracucci opened this issue 2 years ago • 3 comments

While developing instant query time splitting we found that it can cause different query results (compared to a query running without splitting) when using rate() / increase() / delta() and the metric is stale. Instant query time splitting is currently under development and disabled by default.

The reason why a split rate() can cause different query results is due to the extrapolation done by PromQL engine: https://github.com/prometheus/prometheus/blob/be330ac0356fb447613b29501ca4e882211a0b78/promql/functions.go#L59

Example

For example, let's consider a counter series with the following samples:

T: 0 V: 1
T: 30 V: 2
T: 60 V: 3
No more samples from T 60 through T 120

Running the query without splitting

Running the instant query rate(metric[2m]) at time 120 will return the result 0.020833333333333332, because computed as follows:

Sample the interval [0, 60] (first and last sample) computing a value increase of value = 3 - 1 = 2
The sampled interval start timestamp (0) matches with the timestamp of the 1st sample, but the sampled interval end timestamp (120) is < than the timestamp of the last sample (60), so the extrapolation triggers a. Compute the extrapolated interval as sampled interval + (average interval between samples / 2) = 60 + (30 / 2) = 75 b. Compute the updated value adding the extrapolation as value * (extrapolation interval / sampled interval) = 2 * (75 / 60) = 2.5
Compute the final rate as value / query interval = 2.5 / 120 = 0.020833333333333332

Running the query with splitting

Running the same query with a split interval of 1m would result in the following query:

sum without() (
    concat(
        increase(metric[1m] offset 1m)
        increase(metric[1m])
    )
) / 120

To get the same exact results of the query without splitting we would expect the following partial query results:

increase(metric[1m] offset 1m) result 2.5
increase(metric[1m]) result empty (no samples matching this time range)

However, increase(metric[1m] offset 1m) returns 2, so the full query result is 2 / 120 = 0.016666666666666666 (different than the expected 0.020833333333333332).

Why? To understand, let's follow the increase() algorithm (which is the same of rate() with regards to extrapolation) for the query increase(metric[1m] offset 1m) run at time 120:

Sample the interval [0, 60] (first and last sample) computing a value increase of value = 3 - 1 = 2
The sampled interval start timestamp (0) matches with the timestamp of the 1st sample, and the sampled interval end timestamp (60), so the no extrapolation triggers
The final increase value is 2

Jul 29 '22 11:07 pracucci

For reference, the current extrapolation logic was added in this commit: https://github.com/prometheus/prometheus/commit/c77c3a8c56cf72ffca212bd3d34c87cbf8bc772f

Jul 29 '22 12:07 pracucci

An interesting discussion in extrapolation: https://github.com/prometheus/prometheus/issues/3746

Jul 29 '22 12:07 pracucci

To understand why it extrapolate 50% of the scrape interval in the case there's no sample within 110% from first/last sample, see this video starting minute 17:25:

[...] if it's not, it means we think the series has started or stopped, we'll just extrapolate 50% of the interval. For instance, if the time series disappeared, it could have disappeared any time between the last scrape and when the next scrape should have been, so on average it disappears on the middle.

Jul 29 '22 12:07 pracucci

mimir mimir copied to clipboard

Instant query time splitting can cause different query results using rate/delta/increase on stale metrics

Example

Running the query without splitting

Running the query with splitting

mimir
mimir copied to clipboard