Could/should KNN queries use per-segment query caching?
Description
@msokolov and I were talking about how Lucene's KNN queries don't do any per-segment caching because they do all of their work up front in rewrite, which is before per-segment query cache is checked (I think)...
But I think conceptually, the way the optimistic KNN query rewrite works, where it asks each segment for its top N (N = pro-rated estimate from that segment's size vs whole index, and the top K requested), merge sorts all of those, and then goes back to any segment(s) that might still have further competitive hits and digs deeper, ... this should work well with caching, if we could somehow implement it?
I.e. each segment would cache Query (vector + top K) to sparse bitset, and then a future identical KNN query could pull from that cache if its requested top K/N fits within what's already cached?
But, would vector queries have cache hits? At Amazon Product Search we (strangely) can send the identical vector multiple times even within the execution of a single end-Customer query ... would other Lucene apps have similar behavior? E.g. if they are running inference on something simple, like just the search terms from their users, the identical vector could indeed appear? Or maybe this cache wouldn't have to have precisely identical vectors ... maybe vectors that are within some small distance from a cached vector should count as a hit too?
Anyway, no clear way forward here, but I wanted to at least open a discussion in case others have ideas around Lucene's per-segment query caching and KNN queries ...
I wonder if this use-case would be better served by something like Elasticsearch's shard request cache. The cache key is the whole request (query, number of hits retrieved, etc.), plus an identifier of the current point-in-time view of the top-level index reader, and it caches the whole result. So it invalidates more frequently than the query cache (on every refresh), but it is also safe, including for things that have inter-segment dependencies, such as BM25 scores which are computed on global term statistics, or vector search where whether a document is a top-hit depends on whether other segments have better hits or not.
for an http-based service, you can accomplish this by setting cache headers correctly as well. then the caching is much more flexible: can happen on user's device/client, load balancer, anywhere in between: varnish or CDNs or whatever. Doesnt require writing any caches in java code either.