Update buffer to use Parquet in memory

Open pauldix opened this issue 2 years ago • 0 comments

The buffer should use as little memory as possible for the data while at the same time being relatively fast to query. Having data in Parquet serves both purposes. However, Parquet is immutable so we'll need the buffer to have some logic for transforming data periodically while keeping an easily appendable format for buffering.

My basic idea is that for each table:

We have a vec of inidividual writes that can be easily appended to when writes come in
When we've gathered up some number of writes (maybe 8k to match the Datafusion/Arrow batch size preference?), we send a message into a queue of workers to take this data a convert it into parquet
We cache the parquet batches, but once we've reached some number of them, we compact them into a single batch

We'll likely need some logic to split the Parquet data once we get over some amount (like 50M rows or something). Ideally we'd organize that by time and try to have the split parquet files non-overlapping. So in some sense we'd be doing little in-memory compactions within the segment.

Jan 12 '24 20:01 pauldix