influxdb feat: last cache implementation

Closes #25093

Initial cut at the last cache implementation in the write buffer. This adds the LastCacheProvider, which is associated with the write buffer, and holds a map for holding the last cache for any given database/table. Data is loaded into the cache at the same time that batches are written to the segment state, i.e., after the WAL has been flushed.

It only supports one cache per table, and is missing some other key features.

Still To-Do

[ ] More testing
[ ] TableProvider impl is not doing any predicate evaluation
[ ] Support multiple caches per table
[ ] Support cache on column subset in a table

Jun 27 '24 21:06 hiltontj

I also realized while writing out the PR feedback that there's another setting we'll need for the last values cache, which is an age out. Since it'll keep last values caches for each unique key column combination seen, it means that for ephemeral key column values (i.e. ephemeral series), they'll continue taking up space in the cache until they're cleared out.

So that duration for timeout should be an additional parameter on the settings (like count). A sensible default might be 4 hours. So every once in a while, the last value cache should be walked so that any key set that hasn't seen a new value in that time is cleared from the cache completely.

Jun 28 '24 22:06 pauldix

@pauldix thank you for the feedback!

I think there might be a little confusion about the desired behavior of the cache. An example might be helpful here and you can tell me if this structure will handle it, or if you're leaving it for later, or if this is clarifying.

My understanding was definitely off, especially w.r.t. the key columns (literally keys in the cache), but your example clears that up for me. I can re-work the implementation to satisfy that behaviour.

Jun 29 '24 12:06 hiltontj

I also realized while writing out the PR feedback that there's another setting we'll need for the last values cache, which is an age out. [...]

Good call, I can add this to the requirements in the related issues.

Jun 29 '24 12:06 hiltontj

My recent set of commits addresses pretty well all of the feedback so far. I refactored the cache to have a hierarchical structure based on the key columns specified, and those are used to evaluate predicates when producing record batches.

I still have several things to do, however:

Implementation

[x] Handle null values for non-key columns in the cache
[x] Handle * caches, i.e., those that have new fields added when they are written after the cache has been initialized with a certain set of value columns (see https://github.com/influxdata/influxdb/issues/25124)

Testing

[x] Using fields as key columns
[x] Checking the series key is used as default key columns (or the tag set) when key columns are not specified
[x] Test with caches bigger than the default size (1)
[x] Test cache TTLs
[x] Test when using invalid predicates, i.e., for columns that are not key columns or that do not exist, or for column values that do not exist
[x] Test null values in the cache
[x] Test adding new fields to * caches (See https://github.com/influxdata/influxdb/issues/25124)

Documentation

[x] Need to add a lot of Rust doc comments
[x] Update PR description with overview of the refactored design

Jul 03 '24 20:07 hiltontj

Going to open a follow-on PR to address the following:

Make the cache creation more idempotent, i.e., do not fail when attempting to create a cache that already exists with the same parameters
Store a single set of Instants for each cache, vs. an Instant beside each value in each column buffer
Clean up the key column maps when their descendants are empty

Jul 09 '24 15:07 hiltontj

@mgattozzi - I am going to merge this and https://github.com/influxdata/influxdb/pull/25125 to keep things rolling, but if you have any feedback I am happy to open follow-on PRs to address.

Jul 09 '24 19:07 hiltontj

Yeah keep things rolling. I haven't had time to look but I'm not worried about your implementations.

Jul 09 '24 20:07 mgattozzi