mimir
mimir copied to clipboard
Results cache should be invalidated by receipt of older data
Describe the bug
Currently in query-frontend we start caching results once they are 10 minutes old; this is the "cache freshness" setting and it is to allow for new data arriving in that period. However we allow data to arrive any time up to the head boundary, which is 1-2 hours by default.
To Reproduce
(Roughly):
- Submit a query so the result is cached.
- Ingest some new samples where the timestamp is >10 minutes ago
- Submit the same exact same query again.
- Observe that it does not include the new samples.
Expected behavior
Queries should return all data (maybe after some short period to allow things to settle).
Additional Context
New experimental out-of-order feature makes it more likely people will send old samples.
Complicated fix would be to send invalidation messages from ingesters to query-frontend when older data arrives.
Simpler fix would be to expire recent cache results after a few minutes. We would do more work re-evaluating queries where nothing had changed.
Speaking specifically with out-of-order support in mind, since it is already out in the wild, the simpler fix seems attractive for a start while we can look at the complicated fix as well. It seems better than setting cache-freshness to match the out-of-order-time window :)
Simpler fix would be to expire recent cache results after a few minutes. We would do more work re-evaluating queries where nothing had changed.
I'm worried this might lead to query inconsistencies. Even if we set lower TTLs in scenarios where out-of-order is enabled if a query comes before and after an out-of-order write, the latest query would not reflect the new data because TTL has not expired yet.
For that reason I think we should simply not cache up to the out-of-order time window, keeping TSDB's guarantee of read after write.
Query catching already doesn't cache anything within the last 10 minutes so this can be the window we recommend for out-of-order. In the event that the customer needs a bigger window then our recommendation would be to increase the query non-catching time frame to match the out-of-order window.
Query catching already doesn't cache anything within the last 10 minutes
And this is inadequate to avoid problems, as I noted in the description. Following your line of reasoning we should default it to 2 hours, which I feel will give a big degredation in query performance for most people.
(To emphasise: this is an issue which is made worse by out-of-order, but not something unique to it).
It seems better than setting cache-freshness to match the out-of-order-time window :)
As a first solution, is it that bad to set cache-freshness to be greater or equal than OOO time window? What's the expected OOO time window for most customers? If it's in the order of tens of minutes, then I don't think would bring a big performance penalty, but would be easier to build than the solutions mentioned above.
Ignoring out-of-order, we still need to cover a minimum of 1h old queries, since they can still get new timeseries that are not out-of-bounds.
OOO time window might be 30m for most customers, but it can go up to multiple hours, for example 4 hours, at large scale, with heavy queries.
Just to recap, this is what I think we've discussed so far:
- Increasing cache-freshness up to the OOO window or to cover the head boundary for in-order samples would give us the right outputs but will come with a theoretically severe performance impact. (I say theoretically because we did not test it yet but we're confident that this will be the case).
- Lowering the TTLs of extents cached in the query frontend would not fix the problem but in the event of out-of-order data being ingested, the right output for a query will come sooner than later.
Although none of the above are really valid fixes for the issue they can still be options we have to alleviate the symptoms of the existing problem at scale. So perhaps we could make the TTLs configurable and document these two options.
Complicated fix would be to send invalidation messages from ingesters to query-frontend when older data arrives.
I think the next step is to discuss how can we actually invalidate the cache whenever we detect new ingested data affecting possible cached data.
I don't understand the query caching logic completely, tried checking split_and_cache.go - still not sure. So simply asking for 2 confirmations
-
Impact scope of issue This issue which is impacting the (repeated) query results where involved samples came in with a timestamp older than now - 10min (10 min is cachefreshness param) will only impact range queries or it will impact instant queries also? My understanding is only range query is split, instant queries are parsed i.e. all rules (let me know if I am missing something) are not split and hence not cached and hence not impacted by issue.
-
Scope of solution For range queries, this issue is existent irrespective of OOO window feature is enabled or not. Configuring the OOO window to something around 1hr or 2hr with same cachefreshness param is ideal - never gives wrong result but could impact query performance as cache is disabled. Configuring OOO window to 1hr or 2hr and keeping lower cache TTL of 10 mins will ensure eventually consistent results for repeated queries where some samples involved timeseries are coming in with timestamp older than now - 10 min. Basically to gain correctness for this subset of queries increase cachefreshness if we don't care about performance, or configure TTL to gain eventually consistent results with slight hit on performance. Enabling OOO window is completely separate decision, but one can consider it further, especially ingestion is not impacted and query cache is disabled or TTLed as a fix for this bug- one will be like why not allow OOO also.
-
Currently, instant queries are not cached. But this could change in future. Splitting is not relevant.
-
I think what you wrote is correct. As I wrote above, "this is an issue which is made worse by out-of-order, but not something unique to it".
especially ingestion is not impacted and query cache is disabled or TTLed as a fix for this bug- one will be like why not allow OOO also.
Is this a separate question?