Iceberg FileIO support read caches
Is your feature request related to a problem or challenge?
Currently, FileIO will use create_operator to construct an operator reader to fetch data from storage.
But when reading the same files from storage, we want to cache the deserialized arrow array or IO bytes into a hybrid cache system (memory -> local disk --> s3).
It would be better for FileIO to be able to integrate with the custom cache injections.
Describe the solution you'd like
No response
Willingness to contribute
None
Thank you so much for getting this started! As I mentioned in https://github.com/apache/iceberg-rust/issues/1036, I believe iceberg-rust should offer a mechanism that allows users to integrate with any cache of their choice, rather than hardcoding one that they can't customize or replace.
Thanks @sundy-li for raising this. I have concerns adding caching functionality into FileIO. Caching could be a complex problem, which may involve memory management, caching policy, even io scheduling and disk management if we want to involve on disk caching. These things are better coupled with compute runtime, and live out of iceberg crate. WDYT?
I think cache should not be done inside iceberg crate, but we shall add layer function to inject the dynamic cache layer in FileIO struct.
I think cache should not be done inside iceberg crate, but we shall add layer function to inject the dynamic cache layer in FileIO struct.
I think adding dynamic cache layer is one option, but not ideal. I think a better chioce maybe putting caching layer in other components like data file reader, which could do fine grain control. For example, a parquet reader could even choose to cache some columns, rather whole file.
I think adding dynamic cache layer is one option, but not ideal. I think a better chioce maybe putting caching layer in other components like data file reader, which could do fine grain control. For example, a parquet reader could even choose to cache some columns, rather whole file.
Yes, I believe there are several levels of caching that can occur at different layers. @sundy-li is referring to a transparent cache within FileIO (as in opendal), while @liurenjie1024 is talking about an explicit cache outside of FileIO (such as in FileReader). I think both are necessary to make iceberg-rust the best iceberg implementation.
The most important thing to me is to implement them in an extensible way, allowing engines to have full control to optimize them. I'm willing to help implementtion both of them.
As in https://github.com/apache/iceberg-rust/pull/1222, I’ve added ObjectCache. The next step is to introduce a BytesCache. Once that's done, we’ll finally have something to experiment with.
I think it's reasonable to have object cache in iceberg crate, since planning is the core functionality of this library, which will be used by all kinds of users.
But this is not the case for data file read/write. Data file caching is much more complex than metadata caching, and FileIO provides too little information for user, e.g. user can't even distinguish if an input file is metadata file or data file. As a library, I think we should keep iceberg crate as thin as possible.
I agree with @liurenjie1024 around caching in the FileIO layer. My experience of trying this was that the lack of context about what the type of each file is by the time you are inside FileIO limits the flexibility. For example caching both the raw bytes and the parsed object for manifests and manifest lists is a waste of cache space. The ObjectCache / ByteCache approach suggested by @Xuanwo would allow us to retain the context.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'