delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

feat: implement object store that caches to disk

Open PeterKeDer opened this issue 11 months ago • 17 comments

Description

This adds a custom object store that caches files to disk. I've seen some discussions around but no implementation of it yet, so I'm opening this quick PR to gather opinions. Very open on changing the API or implementation of it.

This solution is very simple (possibly overly-naive) and specific to our use-case of delta-rs, so there might be some edge cases from newer features that I did not consider. To enable it, you specify a file_cache_path, and it downloads all files requested by get requests and saves them to this path. For range requests, this still downloads the entire file which means it will be worse for performance in many cases on the first run- I think there's a lot of potential optimizations to be done here.

There is also an option to specify a cache duration for _last_checkpoint which to my knowledge is the only file that is mutable. We found that this increases performance by using the local (and potentially outdated) checkpoint parquet because loading incremental json files are way faster than a new checkpoint parquet.

We've been using this at our company for the last couple weeks. It has been very helpful for loading large datasets, so we think a caching feature like this would be quite valuable.

Usage example:

dt = DeltaTable(
  "s3://bucket/table",
  storage_options={
    "file_cache_path": "/Users/peter/delta-cache",
    "file_cache_last_checkpoint_valid_duration": "30m",
})
# do stuff with dt

PeterKeDer avatar May 17 '25 07:05 PeterKeDer

@roeap was actually going to add a caching object store based on foyer this weekend, which supports hybrid caching :)

ion-elgreco avatar May 17 '25 08:05 ion-elgreco

Just opened a draft PR - still need to wire things through, but the cache implementation should be close to finished.

@PeterKeDer - would love your feedback and happy to discuss which route we should take.

The most obvious difference would be the I used an external implementation of a cache. The policy is also more selective in that we right now only cache specific log files (the json ones) and omit the others. We do however try to make the cache sharable, such that it can be used for multiple table instances - i.e. we use a fully qualified URL as cache key.

To enable additional caching there have been various updates to delta kernel, which allow us to incrementally update a snapshot and scan state.

These would however only get activated once we have kernel for log handling ...

roeap avatar May 17 '25 09:05 roeap

Codecov Report

:x: Patch coverage is 5.48780% with 155 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 14.57%. Comparing base (6a4f249) to head (3643c91). :warning: Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/logstore/storage/file_cache.rs 0.00% 145 Missing :warning:
crates/core/src/logstore/config.rs 52.94% 6 Missing and 2 partials :warning:
crates/core/src/logstore/mod.rs 0.00% 0 Missing and 2 partials :warning:

:exclamation: There is a different number of reports uploaded between BASE (6a4f249) and HEAD (3643c91). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (6a4f249) HEAD (3643c91)
6 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3439       +/-   ##
===========================================
- Coverage   74.50%   14.57%   -59.94%     
===========================================
  Files         146       71       -75     
  Lines       44876    12273    -32603     
  Branches    44876    12273    -32603     
===========================================
- Hits        33437     1789    -31648     
- Misses       9248    10270     +1022     
+ Partials     2191      214     -1977     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar May 17 '25 13:05 codecov[bot]

@roeap looks cool! Glad you're already working on it 😄

I'm curious about the decision to only cache json log files. What's fundamentally blocking your implementation from also caching checkpoints and data files? I'm not too familiar with the delta kernel so there may be a knowledge gap there.

From my experience, downloading the data files generally consumes a lot more time than updating the transaction log. Having a cache work for the data files too would be very valuable for our use cases.

PeterKeDer avatar May 17 '25 20:05 PeterKeDer

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

  1. we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.

  2. delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.

  3. In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

roeap avatar May 17 '25 20:05 roeap

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

  1. we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.

  2. delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.

  3. In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

ion-elgreco avatar May 17 '25 20:05 ion-elgreco

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

roeap avatar May 17 '25 20:05 roeap

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

This one from seafowl: https://github.com/splitgraph/seafowl/blob/main/src/object_store/cache.rs

ion-elgreco avatar May 17 '25 21:05 ion-elgreco

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

  1. we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.
  2. delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.
  3. In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

Makes sense, thanks for the explanation!

Yep, the caching here is entirely configurable, it is disabled by default but you can just pass in a storage option to enable it. If you think this PR is fine to merge, I'm happy to clean this up a bit and add some few test cases. We should also make clear of the potential downsides, i.e. slower initial reads if we rely on range requests, and expanding cache size as we load more files.

PeterKeDer avatar May 18 '25 04:05 PeterKeDer

@roeap @ion-elgreco I cleaned this up & added simple tests, appreciate if you can take a look and let me know your thoughts

PeterKeDer avatar May 19 '25 20:05 PeterKeDer

@roeap I took a shot at moving the config over to DeltaConfig, let me know if I'm using it correctly. Also, I noticed decorate_store is not being applied to the root store, but it probably should if we add functionality other than prefix. I added a separate function to handle that, let me know if that makes sense.

Separately, do you happen to know why this test is failing? e.g. here. It passes for me locally when running cargo test --verbose --features azure,datafusion,s3,gcs,glue,hdfs

PeterKeDer avatar May 21 '25 01:05 PeterKeDer

@PeterKeDer - looking good, the config system is used as intended. In case you have feedback w.r.t. ease of use, happy to hear it, since this is very new.

You are absolutely right with the decoration being applied to the root store. We will eventually hopefully be able to get rid of the prefixed store all together. That said, I'll likely have to tinker a bit with how we apply things since we may have some requirements to access async code when setting things up (e.g. to get a dedicated io runtime without risking to be outside of async context) while also avoiding making all the setup code async.

I'll have a look at the test failure later today and see if I can support here.

roeap avatar May 21 '25 12:05 roeap

@roeap were you able to take a look at the test failures? I wonder if there are differences writing files when the test is running in the pipeline vs locally that causes the cache to not work as intended

PeterKeDer avatar May 28 '25 19:05 PeterKeDer

@PeterKeDer - after a bit of investigating, I believe the reason is that the temporary director gets dropped (and with that deleted) prior to the last access, so we effectively purged our cache. Why this shows up only in CI and and not locally 🤷‍♂️.

We have seen some flakiness with tempdir() in the past as well. While we could see that we somehow manage the lifetime of the dir differently, I'd suggest a slight refactor to a) work around this and b) be a bit more flexible i think. I tried this out here - essentially making the store generic over the file cache store.

roeap avatar May 29 '25 11:05 roeap

Ahh interesting, thanks for looking into it! I did use tempdir in some other tests though, I wonder why they didn't fail 🤔

The change to make the store more generic makes sense, people can customize the cache if needed; let me add that change and update the tests.

PeterKeDer avatar May 29 '25 19:05 PeterKeDer

@roeap updated per your changes, let me know what you think

PeterKeDer avatar May 30 '25 19:05 PeterKeDer

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

Regarding this, in DF50 there's now a parquet file cache that can be instantiated. Since the parquet handling is done with the delta kernel and not Datafusion there's probably no way to directly benefit from it, but this epic probably has nice snippets for what it could look like. https://github.com/apache/datafusion/issues/17000