Make FileIO a Trait
Is your feature request related to a problem or challenge?
Originally proposed on https://github.com/apache/iceberg-rust/issues/172#issuecomment-2472807387 making FileIO a trait would allow for more pluggable storage access. This in turn would potentially allow better integration where people already have an existing storage setup, e.g. based on object_store, that they want to use.
Describe the solution you'd like
I would like FileIO to be changed to a trait, allowing for pluggable storage engines.
The major remaining questions from the linked issue are:
- Should implementations pass around
Arc<dyn FileIOor something else - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473145048 https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473675530 - How important is preserving compatibility with iceberg-java - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473277943 https://github.com/apache/iceberg-rust/issues/172#issuecomment-2474390321
- Can we make breaking changes to iceberg-rust - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2481267603
If the decision is that we can break compatibility with iceberg-java and are happy to use a trait object, the next question is
- How should the interface differ from ObjectStore - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473365948
My 2 cents is that designing a custom abstraction at this granularity when one already exists and is well adopted within the Rust data ecosystem, seems unnecessary.
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community
Additional Context
See also https://github.com/apache/datafusion/pull/15018#issuecomment-2708822164
Thank you for getting this started. I’ve been thinking about this as well. I believe the FileIO trait is important for iceberg-rust, as it helps separate our abstraction from the underlying implementation.
I'm willing to work on this.
IMO it would be good to get consensus on the outstanding questions before proceeding with an implementation. I think it would be good to articulate what differences this FileIO trait would have from existing ecosystem abstractions and therefore what problems it seeks to solve. I'm not sure people are on the same page here
It seems that FileIO internally manages Storage which is currently an enum. Maybe we should revisit the design of both together?
The only other thing in FileIO is FileIOBuilder, which seems to be configuration for the storage. As I mentioned earlier, I was wondering if the Iceberg library really need to manage such configuration. It might be more flexible to simply accept a storage implementation (possibly some Arc<dyn ...>) without knowing how the storage is configured. This is the idea of "dependency injection".
FWIW, FileIO::into_builder() does not seem to be used anywhere in the project.
(I think it's fine to have helper functions to build FileIO from configuration, but the configuration does not need to be stored once the data storage is instantiated.)
If we consider FileIO and Storage together, I realized that there is actually an alternative design for the IO abstraction. (This assumes that we remove FileIOBuilder from FileIO.)
#[derive(Clone, Debug)]
pub struct FileIO {
inner: Storage,
}
#[derive(Clone, Debug)]
pub(crate) enum Storage {
/// An OpenDAL operator.
Operator(Operator),
/// An object_store implementation.
ObjectStore(Arc<dyn ObjectStore>),
}
This alternative has a few benefits. (Let me know if there is any drawback that I'm not aware of.)
- There is no change to the usage of
FileIO. Storagecan be cheaply cloned and be part ofInputFileandOutputFile.- We do not need yet another abstraction for file/object operations on top of OpenDAL or object_store. (This is a discussion point raised by @tustvold.)
- The intent is conveyed clearly that
FileIOshould work with both OpenDAL and object_store. - There is no longer need to wrap
ObjectStoreas an OpenDAL operator. In fact, after looking deeper into this, I'm not sure if the wrapper is in general well-defined since some of the OpenDAL methods (e.g. creating directory) are not part of theObjectStorecontract.
Happy to discuss!
The major downside I can see of an enum based approach is that it forces the variants to be enumerated, which in turn limits downstream extensibility. This can be fudged over with feature flags but being able to have separate crates implementing a common interface typically ends up being easier to maintain for all involved.
We do not need yet another abstraction for file/object operations on top of OpenDAL or object_store.
TBC it is still another abstraction, regardless of if implemented as a trait or a crate private enum. It will entail building custom parquet readers, config handling, path representation, etc... and in turn limit the ability for people to bring their own existing implementations and setups.
Thanks @tustvold. Yeah the downside of Storage enum makes a lot of sense to me. I can see that a trait would be more extensible in general for downstream crate.
The solution I proposed was more like a short-term solution. I found this would result in less code change and smaller blast radius, given the current status of how FileIO is used in the Iceberg library. Although Storage is still an enum, it supports a wide range of use cases, assuming that OpenDAL and object_store have emerged as the top standards for storage abstraction in the Rust community. This would give us a quick path for object_store integration, while we evaluate the best path forward.
Or maybe we can have Storage as a trait while FileIO stays as a struct? (Again, I can see that the indirection is less ideal. What I was looking for is a migration path with least disruption, considering there are also InputFile and OutputFile etc.)
It's worth mentioning that I've raised a somewhat related issue in the past regarding the decoupling of building & serialization. On that note - I don't think Iceberg rust needs to necessarily even provide a storage implementation - that's something I would argue users generally already have covered beforehand. If the entire set of metadata types had their building and serialization/de-serialization decoupled, users would have more control over where the Iceberg metadata they build gets written.
One thing I think that ties into that though is that it's been mentioned by some of the developers that users of Iceberg Rust should use the transaction API in order to create tables - this effectively means that (and please correct me if I'm wrong here):
- Users need to provide an Iceberg Catalog implementation
- Users write their manifests and manifest lists
- This means that the user is responsible for handling building and serde for these types (
manifest-list.avro&manifest-file.avro)
- This means that the user is responsible for handling building and serde for these types (
- Users use the catalog to generate table metadata
- This means that the catalog is responsible for building and serde of table metadata (eg
v1.metadata.json)
- This means that the catalog is responsible for building and serde of table metadata (eg
With this understanding in mind - it is not clear to me why there is a storage API at all. If users are intended to build & write their own manifest lists, then Iceberg Rust should only need to provide the infrastructure for building and serializing those types, as users should be able to decide where & how they are written. In addition to that, if the Iceberg catalogs are responsible for the building and serialization of table metadata, again, why does Iceberg Rust care about the storage aspect of the table metadata, as that's a catalog implementation's responsibility.
Hi, sorry for being late for this party, and thanks @tustvold for the summary of the discussions.
- Should implementations pass around Arc<dyn FileIO or something else - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473145048 https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473675530
While we could abstract out underlying implementation using different providers/implemtations, I still recommend wrapping it in a struct when passing around in the crate. This helps to eliminate limitations of object safety in rust.
- How important is preserving compatibility with iceberg-java - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473277943 https://github.com/apache/iceberg-rust/issues/172#issuecomment-2474390321
I believe it's important to keep the abstractions/functionality required by java api, such as InputFile/OutputFile/delete, as these are required by other components. But it doesn't have be exactly same as, and should be idiomatic for rust developers.
- Can we make breaking changes to iceberg-rust - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2481267603
I think it's fine to make breaking changes if we can't avoid.
- How should the interface differ from ObjectStore - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473365948
The reason we should keep FileIO trait rather than using ObjectStore trait directly is that we need to keep consistent as much as java implementation(reference implementation). For example, java's FileIO has following extensions recently:
https://github.com/apache/iceberg/blob/f06c4f7dfc98cc944a0e1d3a7b38ade0aaa52ce3/api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java#L25 https://github.com/apache/iceberg/blob/f06c4f7dfc98cc944a0e1d3a7b38ade0aaa52ce3/api/src/main/java/org/apache/iceberg/io/SupportsBulkOperations.java#L21 https://github.com/apache/iceberg/blob/50d310aef17908f03f595d520cd751527483752a/api/src/main/java/org/apache/iceberg/encryption/EncryptingFileIO.java#L37 https://github.com/apache/iceberg/blob/e9364faabcc67eef6c61af2ecdf7bcf9a3fef602/api/src/main/java/org/apache/iceberg/io/SupportsRecoveryOperations.java#L27 https://github.com/apache/iceberg/blob/817dc35a924b403716d2eb899aba46f3398a5ca9/core/src/main/java/org/apache/iceberg/io/SupportsStorageCredentials.java#L27
I believe keeping a FileIO trait rather than using object_store would make evolving easier.
Despite the discussion points raised by @tustvold , I have other things to discuss:
- Should we have an unified
FileIOBuildertrait, just like what we did forCatalogBuilderin https://github.com/apache/iceberg-rust/pull/1231 - Should we only keep traits like
FileIO(and potentiallyFileIOBuilder) in core crate, and move concrete implementation to other crate? Just like what we did forCatalog?
I still recommend wrapping it in a struct when passing around in the crate
So are you proposing making Storage a trait instead? If so for the sake of argument, could this just be ObjectStore?
Should we have an unified FileIOBuilder trait
FWIW we are upstreaming the ObjectStoreRegistry abstraction from DF that may serve as inspiration - https://github.com/apache/arrow-rs-object-store/pull/375
Should we only keep traits like FileIO(and potentially FileIOBuilder) in core crate, and move concrete implementation to other crate? Just like what we did for Catalog
This seems sensible to me FWIW
I still recommend wrapping it in a struct when passing around in the crate
So are you proposing making Storage a trait instead? If so for the sake of argument, could this just be ObjectStore?
I still have concerns for using ObjectStore directly for several reasons:
- Since the goal of this refactoring is to make underlying implementation extensible and allowing user to choose provider freely, I don't think it's a good idea to bind the interface to a specific vendor. For example, we also want to allow user to use
OpenDAL. - Another important goal of this refactoring is to even allowing user to provide their own
FileIOprovider(orStorage) implementation. I took a look atObjectStoretrait, it's not aligned well withFileIOrequirement. For example, it requirescopy/listas required methods, which are not required byFileIO(this interface is optional).
Since the goal of this refactoring is to make underlying implementation extensible and allowing user to choose provider freely, I don't think it's a good idea to bind the interface to a specific vendor.
This is the goal of ObjectStore, it represents an abstraction that has been developed and iterated on by the Rust data ecosystem to allow decoupling things like DataFusion and polars from specific implementations.
Now I'm not claiming it is perfect, but it does represent a non-trivial accumulation of knowledge up to this point.
As with anything there are engineering tradeoffs, and it is possible iceberg has different requirements, and that's fine. But my hope is that by finding out what these are:
- We can ensure the proposed design actually delivers these requirements
- We can potentially take some learnings back to ObjectStore to benefit the ecosystem more broadly
- We avoid NotInventedHere syndrome and can potentially short-circuit a drawn out design process
For example, we also want to allow user to use OpenDAL.
There is object_store_opendal
For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).
You can and people do leave methods unimplemented.
For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).
You can and people do leave methods unimplemented.
That's the problem, we are defining a richer interface that's not used by iceberg, then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not? Also confusing for user, they are supposed to be careful when using FileIO since some implementation may not implement some methods.
Also another concern is that, if we delegate a core abstraction like FileIO to object_store, we may experience unnecessary breaking changes introduces when object_store evolves.
About the design part, I'm thinking about following design considerations:
- We should define small trait rather than a large trait. For example we could define traits like following
FileIO,SupportsBulkOperations, etc. So that concrete implementations could choose which traits to implement according their capability. - We could have an erased trait object definition for traits defined in 1, say
DynFileIO, which are object safe. - For end user, we could provide a struct to wrap the
DynFileIOtrait object.
(We could discuss naming of each part)
Here is an example:
trait InputFile {
...
}
trait FileIO {
type I: InputFile;
}
trait DynFileIO {
}
struct FileIOWrapper {
inner: Arc<dyn DynFileIO>
}
That's the problem, we are defining a richer interface that's not used by iceberg, then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not? Also confusing for user, they are supposed to be careful when using FileIO since some implementation may not implement some methods.
This seems a good point. I feel an explicit interface conveys the intent if Iceberg only uses a small set of file/object operations.
Here is an example:
I'm not sure if I understand this though, especially the difference between FileIO and DynFileIO. Also it seems InputFile etc. are trait now, and I'm worried that it may add burden to the implementer.
Following my earlier thinking around making Storage a trait, would the following be a viable option?
pub struct FileIO {
inner: Arc<dyn Storage>,
}
pub(crate) trait Storage {
}
pub struct InputFile {
inner: Arc<dyn Storage>,
path: String,
}
Just sharing some experiences from the delta world which may not immediately applicable to the question around which trait to use, but maybe be food for thought as to where things could be heading?
One thing that repeatedly comes up when talking about table formats is "Metadata is Data". The to me logical consequence of that is to treat it as such, meaning process it with the same tools that you would use processing data. To that avail delta-rs currently keeps all metadata around as arrow record batches, and delta-kernel goes even further abstracting away the specific data representation.
As such the higher level abstractions we chose are on the level of file formats. I.e. read this {parquet,json,avro,..} file into arrow with this schema. The internal logic processing the metadata either visits individual fields or applies expressions on the data to generate the plans for scans etc. I think to a certain degree this thinking is actually baked into the Iceberg spec via the metadata tables.
By default we provide an arrow (arrays and kernels) and object_store based implementation using many of the same tools used here to read data. Currently I am working on a datafusion engine for kernel, where datafusions execution plans are used to read data and datafusions' native expression for evaluation.
As a consequence virtually all resource management is under full control of the query engine which is also free to apply any more advanced optimisations (caching, etc.) as it sees fit.
All that said, I am about to start a PoC to find out how much of the query planning and eventually also maintenance that is implemented in aforementioned datafusion engine can be applied to both delta and iceberg.
One thing I am fairly certain of is that the work discussed here will be making my life much easier, and if we end up in a place where we can at least do something like ...
impl<T: ObjectStore> FileIo for T {
...
}
that would be awesome!
Once we have a consensus here, I am happy to offer my support driving this forward!
Also another concern is that, if we delegate a core abstraction like FileIO to object_store, we may experience unnecessary breaking changes introduces when object_store evolves.
FWIW we try very hard to avoid these now, aiming for ~2 breaking releases per year. I would eventually like to release a v1.0.0, but that is unlikely to be this year.
We should define small trait rather than a large trait. For example we could define traits like following FileIO, SupportsBulkOperations , etc. So that concrete implementations could choose which traits to implement according their capability.
At least historically this sort of trait composition hasn't worked very well with trait objects. In particular you can't write Box<dyn A + B>, only auto-traits are supported. You could get around this by defining trait AB: A + B and then Rust versions > 1.86 can upcast, but this would quickly get unmanageable as the number of traits grows.
Whilst returning Err(Error::Unimplemented) is not as nice as getting a compile error, it is the best approach we've been able to devise.
We could have an erased trait object definition for traits defined in 1, say DynFileIO, which are object safe.
My experience with ObjectStore, is that overtime this interface will likely grow. Ultimately ObjectStore started out as precisely this, a very targetted IO abstraction for InfluxDB IOx, it then got donated to arrow-rs and overtime as more people have used it the interface has grown to its current state to accommodate their requirements. I suspect such a DynFileIO would likely have to go through the same process, especially if the goal is for people to use it as more than just a shim to this crate.
then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not?
I had been viewing FileIO purely as a mechanism to shim into whatever IO abstraction is used by the broader system they're integrating with and not really as something people would generally be implementing or interacting with themselves.
However, I think this is the key perspective difference that is making consensus hard to come by, as I think there are two possible goals being discussed here and on the other linked tickets:
- Provide an extensible IO interface inspired by iceberg-java people can use across their applications
- Provide a way to integrate their existing ObjectStore based systems with this crate
Ultimately there are number of people with a demonstrable need for the latter, whereas I believe the former is largely theoretical at this stage. Further, as both @roeap and @Sl1mb0 allude to, people want to integrate iceberg into a broader system, and so even if iceberg devised a FileIO interface, other systems will likely continue to use something more expressive.
In the interests of making some progress perhaps we might find some way to decouple these objectives?
Some suggestions from the various tickets:
- Add an ObjectStore variant to the storage enum - https://github.com/apache/iceberg-rust/issues/1314#issuecomment-2879956729
- Add an extension variant to the storage enum - https://github.com/apache/iceberg-rust/issues/172#issuecomment-2473264890
- Make Storage a trait - https://github.com/apache/iceberg-rust/issues/1314#issuecomment-2880758072
- Replace Storage with ObjectStore - https://github.com/apache/iceberg-rust/issues/1314#issuecomment-2893032985
All of these would provide a way to solve the pain point that people are currently running into without requiring complex design work, and I'd be happy to help out once we achieve consensus on what it is we would like to do.
A good FileIO/Storage abstraction not only benefits object_store integration, but also makes it easier to work with OpenDAL in Iceberg.
For example, I noticed that there are feature requests such as #1360. If the FileIO abstraction is extensible, such integrations probably do not need to happen within the Iceberg repo. The user should be able to use whatever storage systems supported by OpenDAL.
I'm not sure if I understand this though, especially the difference between FileIO and DynFileIO. Also it seems InputFile etc. are trait now, and I'm worried that it may add burden to the implementer.
It will not. Vendors just need to implent FileIO trait, which ensures strong type check. And we can auto implement DynFileIO for any type that implements FileIO, like following:
impl<T: FileIO> DynFileIO for T {
}
Make Storage a trait - https://github.com/apache/iceberg-rust/issues/1314#issuecomment-2880758072
I would vote for this one, with this approach we have two benefits:
- Allowing user to provide its own implementation
- We could move concrete implementations out of core crate.
Sorry for I didn't join the party earlier.
I agree with the idea of making Storage a trait and turning FileIO into a wrapper that can be used throughout the entire project.
The Storage trait can be dedicated solely to the iceberg use case, while also enabling users to implement their own custom IO solutions. I’m aware that some vendors prefer to use their own high-performance IO implementations, optimized with io-uring and RDMA, which are not currently supported by either object_store or opendal. Introducing a dedicated storage trait would make this possible.
This trait also prevents iceberg-rust from depending directly on any implementations, allowing it to evolve according to our own needs without being constrained by upstream dependencies.
In addition to Storage itself, we will also need something like a StorageBuilder to accept storage options passed from the REST Catalog or Pyiceberg. It would also be useful to have a StorageRegistry that holds different instances of s3://bucket/path/to/file to avoid rebuilding them every time.
Like what we do for Catalog, maybe we can add Storage trait and FileIO struct in the core and have something like iceberg-storage-opendal and iceberg-storage-object-store?
Like what we do for
Catalog, maybe we can addStoragetrait andFileIOstruct in the core and have something likeiceberg-storage-opendalandiceberg-storage-object-store?
This sounds like a great idea! I guess StorageBuilder and various storage-* features could be part of iceberg-storage-opendal if the implementation uses OpenDAL? It seems that PyIceberg would be using StorageBuilder while other downstream Rust projects can choose whatever implementation they want and depend only on the core crate.
Reading through this it seems like the consensus is to try to make Storage a trait with a builder on it? @Xuanwo are you actively working on this?
Hi all, I'm currently working on a draft for this and will post it for comments/feedback soon!
Hi folks, I've drafted this doc to summarize what have been discussed here plus some code snippets to showcase what the new trait would look like: https://docs.google.com/document/d/1-CEvRvb52vPTDLnzwJRBx5KLpej7oSlTu_rg0qKEGZ8/edit?tab=t.0#heading=h.a0d7lqurg3dq
Please feel free to share your inputs on the design!