feat: implement object store that caches to disk
Description
This adds a custom object store that caches files to disk. I've seen some discussions around but no implementation of it yet, so I'm opening this quick PR to gather opinions. Very open on changing the API or implementation of it.
This solution is very simple (possibly overly-naive) and specific to our use-case of delta-rs, so there might be some edge cases from newer features that I did not consider. To enable it, you specify a file_cache_path, and it downloads all files requested by get requests and saves them to this path. For range requests, this still downloads the entire file which means it will be worse for performance in many cases on the first run- I think there's a lot of potential optimizations to be done here.
There is also an option to specify a cache duration for _last_checkpoint which to my knowledge is the only file that is mutable. We found that this increases performance by using the local (and potentially outdated) checkpoint parquet because loading incremental json files are way faster than a new checkpoint parquet.
We've been using this at our company for the last couple weeks. It has been very helpful for loading large datasets, so we think a caching feature like this would be quite valuable.
Usage example:
dt = DeltaTable(
"s3://bucket/table",
storage_options={
"file_cache_path": "/Users/peter/delta-cache",
"file_cache_last_checkpoint_valid_duration": "30m",
})
# do stuff with dt
@roeap was actually going to add a caching object store based on foyer this weekend, which supports hybrid caching :)
Just opened a draft PR - still need to wire things through, but the cache implementation should be close to finished.
@PeterKeDer - would love your feedback and happy to discuss which route we should take.
The most obvious difference would be the I used an external implementation of a cache. The policy is also more selective in that we right now only cache specific log files (the json ones) and omit the others. We do however try to make the cache sharable, such that it can be used for multiple table instances - i.e. we use a fully qualified URL as cache key.
To enable additional caching there have been various updates to delta kernel, which allow us to incrementally update a snapshot and scan state.
These would however only get activated once we have kernel for log handling ...
Codecov Report
:x: Patch coverage is 5.48780% with 155 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 14.57%. Comparing base (6a4f249) to head (3643c91).
:warning: Report is 28 commits behind head on main.
:exclamation: There is a different number of reports uploaded between BASE (6a4f249) and HEAD (3643c91). Click for more details.
HEAD has 5 uploads less than BASE
Flag BASE (6a4f249) HEAD (3643c91) 6 1
Additional details and impacted files
@@ Coverage Diff @@
## main #3439 +/- ##
===========================================
- Coverage 74.50% 14.57% -59.94%
===========================================
Files 146 71 -75
Lines 44876 12273 -32603
Branches 44876 12273 -32603
===========================================
- Hits 33437 1789 -31648
- Misses 9248 10270 +1022
+ Partials 2191 214 -1977
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@roeap looks cool! Glad you're already working on it 😄
I'm curious about the decision to only cache json log files. What's fundamentally blocking your implementation from also caching checkpoints and data files? I'm not too familiar with the delta kernel so there may be a knowledge gap there.
From my experience, downloading the data files generally consumes a lot more time than updating the transaction log. Having a cache work for the data files too would be very valuable for our use cases.
@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.
-
we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.
-
delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.
-
In a kernel log replay, we discard some information from the json files we may later need.
With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.
Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).
All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...
All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?
@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.
we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.
delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.
In a kernel log replay, we discard some information from the json files we may later need.
With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.
Kernel also introduces file format level abstractions - i.e.
read_parquet- which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...
All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?
Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level
Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level
Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.
Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level
Which one is that? When the time comes I wanted to do some research... I also think that
datafusion-comethas some clever caching build in I wanted to look at.
This one from seafowl: https://github.com/splitgraph/seafowl/blob/main/src/object_store/cache.rs
@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.
- we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.
- delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.
- In a kernel log replay, we discard some information from the json files we may later need.
With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.
Kernel also introduces file format level abstractions - i.e.
read_parquet- which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...
All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?
Makes sense, thanks for the explanation!
Yep, the caching here is entirely configurable, it is disabled by default but you can just pass in a storage option to enable it. If you think this PR is fine to merge, I'm happy to clean this up a bit and add some few test cases. We should also make clear of the potential downsides, i.e. slower initial reads if we rely on range requests, and expanding cache size as we load more files.
@roeap @ion-elgreco I cleaned this up & added simple tests, appreciate if you can take a look and let me know your thoughts
@roeap I took a shot at moving the config over to DeltaConfig, let me know if I'm using it correctly. Also, I noticed decorate_store is not being applied to the root store, but it probably should if we add functionality other than prefix. I added a separate function to handle that, let me know if that makes sense.
Separately, do you happen to know why this test is failing? e.g. here. It passes for me locally when running cargo test --verbose --features azure,datafusion,s3,gcs,glue,hdfs
@PeterKeDer - looking good, the config system is used as intended. In case you have feedback w.r.t. ease of use, happy to hear it, since this is very new.
You are absolutely right with the decoration being applied to the root store. We will eventually hopefully be able to get rid of the prefixed store all together. That said, I'll likely have to tinker a bit with how we apply things since we may have some requirements to access async code when setting things up (e.g. to get a dedicated io runtime without risking to be outside of async context) while also avoiding making all the setup code async.
I'll have a look at the test failure later today and see if I can support here.
@roeap were you able to take a look at the test failures? I wonder if there are differences writing files when the test is running in the pipeline vs locally that causes the cache to not work as intended
@PeterKeDer - after a bit of investigating, I believe the reason is that the temporary director gets dropped (and with that deleted) prior to the last access, so we effectively purged our cache. Why this shows up only in CI and and not locally 🤷♂️.
We have seen some flakiness with tempdir() in the past as well. While we could see that we somehow manage the lifetime of the dir differently, I'd suggest a slight refactor to a) work around this and b) be a bit more flexible i think. I tried this out here - essentially making the store generic over the file cache store.
Ahh interesting, thanks for looking into it! I did use tempdir in some other tests though, I wonder why they didn't fail 🤔
The change to make the store more generic makes sense, people can customize the cache if needed; let me add that change and update the tests.
@roeap updated per your changes, let me know what you think
Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level
Which one is that? When the time comes I wanted to do some research... I also think that
datafusion-comethas some clever caching build in I wanted to look at.
Regarding this, in DF50 there's now a parquet file cache that can be instantiated. Since the parquet handling is done with the delta kernel and not Datafusion there's probably no way to directly benefit from it, but this epic probably has nice snippets for what it could look like. https://github.com/apache/datafusion/issues/17000