delta-rs
delta-rs copied to clipboard
Object Store Caching layer
Description
Use Case We currently in the worst case read them both in full twice, we can however cache the first round of reading, which will reduce the second round of getting files from an object store
Related Issue(s) See discussion here https://github.com/delta-io/delta-rs/issues/2760
General I/O caching support is engine's responsibility, not kernel's. Kernel just needs to not get in the way of an engine to implement such caching.
To that end, it would be more helpful to identify specific ways kernel blocks or impedes the engine from doing the kinds of caching it would like to do.
For example, the JsonHandler and ParquetHandler traits provided by the engine should make ideal hook points for caching the results of file reads. The engine currently performs file writes on its own, without kernel involvement, and so it would be up to engine to introduce appropriate caching there as it sees fit.
On the other hand, I know @roeap has been considering higher level caching approaches in delta-rs that would capture the result of log replay rather than going back to the underlying individual files at all. As we identify such optimization opportunities, kernel APIs may need to adapt if the engine currently lacks appropriate hook points.
Heh. I missed that this was a delta-rs issue, not delta-kernel-rs!
That said, if this work does identify gaps in kernel APIs, please do file enhancement issues against delta-kernel-rs.
@scovich, while certainly an engine issue, I have been considering contributing some of this to the default engine implementation in kernel... But need to look deeper into how much this will blow up..
That said, I do believe the kernel APIs lend themselves well to caching as we can explicitly also cache parsing (and leverage arrows on-disk representation that was build for this purpose - i think)
@roeap Have you had any time to think about a direction for caching? I would love to help push this forward if there's a design direction.
@tonyalaribe - indeed we did. The current best guess is that we'll be quite "surgical" when it comes to caching. Generally speaking we want to be as lazy as possible to don't do any work b/c we might need it. This particularly applies to reading non-file actions from the log (e.g. Txn, CommitInfo, ...). Aside from IO, we also have the opportunity to save on decoding costs when reading parquet. With that there are three main strategies we apply.
-
To offset not storing non-file actions we cache the raw bytes read from storage for json commit files and deletion vectors. We cannot store the parsed JSON, as we discard that data during parsing already. IO level caching should offset the bulk of the cost though.
-
Parsed file actions from a scan. This is what we currently keep in memory, but we can save big on scans if we can provide existing data when we want to update versions etc. There is a draft PR open on kernel.
-
Parquet footer caching. Our parquet reads are quite selective, so we are not waisting anything during log replay. However especially the non-file actions are quite sparse, so if we cache the parquet footers, we should be able to do some effective skipping when reading these actions from checkpoints.
1 and 2 apply exclusively to the metadata phase - i.e. reading data from the delta log - 3 however applies also when we are reading the actual data files. I am hoping that by integrating with datafusion which has clear instructions on how such caches can be integrated we can keep it fairly simple on our end.
As always there is more to learn along the way .. I'll be putting up a PR for the byte level cache and well get 2 once kernel lands.
Any help either via advice/opinions or PRs is of course very welcome :).