discuss: Split iceberg-rust high level API into a mini engine instead
Hi, I'm starting this thread to discuss if it a good idea to split iceberg-rust high level API into a mini engine instead.
In iceberg-rust, we seems to have two different level of APIs. One is the low-level API which expose the TableMetadata, Schema directly, the other is the high-level API that exposes feature like Scan. We have such design to make our users eaiser to get started.
However, after using iceberg-rust in a production environment, I found that this design is not ideal for an external query engine to optimize performance. We either did too much on the iceberg-rust side, such as implementing a hard in-memory cache in our code, or we didn't expose enough of the necessary APIs for query engines, like ArrowReader, which prevents us from using or tuning the Reader from arrow-rs directly.
I'm wondering if it would be a good idea to split the current high-level APIs into a mini engine. This way, users who want a quick start or need to run workloads on a single machine can use the mini engine instead.
This split would greatly help us design the API: If an API is used in both mini and datafusion, it should probably belong in our core. However, if an API is only used in mini, it should remain within mini.
What do you think?
Cc @liurenjie1024, @sdd, and @Fokko for their input. I especially want @sdd's feedback, as he put a lot of effort into this part.
Also @a-agmon who is working on datafusion.
Thanks @Xuanwo for raising this, I think in general it's a good idea to have a standalone module for simple data processing, just like what we already have in java, e.g. an iceberg-data module.
That's to say we have following crates:
iceberg
/ \
iceberg-data iceberg-datafusion
The iceberg crate is similar to the iceber-core + iceberg-api module in java, which contains api for manipulating metadata, and it's supposed to be compute engine independent. Also it's used for planning scan tasks for compute engines.
The iceberg-data crate contains necessary implementation for executing table scan in local machine, and doesn't depend on any execution engine. We may add feature of appending data to table, but row level modifications seem somehow challenging at this time.
The iceberg-datafusion crate contains integration with datafusion, which allows users to execute sql against iceberg, and of course powered by datafusion.
Thanks, @Xuanwo.
This makes a lot of sense.
While I see the value in separating core structures and functionality from query engine implementation, I'm wondering what do you see as the content of mini (besides scan() for example). It just seems to me very thin. Makes me wonder if there is advantage in creating something like mini rather than integration crates - DataFusion, Polars, etc that simply shows implementation of how to use this with these engines.
Hi @Xuanwo. I find myself agreeing in part with @a-agmon - whilst conceptually I agree that re-architecting into an iceberg-core and an iceberg-engine-lite does provide the opportunity to structure the API along seams that would allow us to provide better integration hooks to downstream engines, the engine-lite itself would be quite small. However I don't necessarily see that engine-lite itself being small / lightweight is necessarily a problem. I could see Scan and some of the high-level parts of ArrowReader being subsumed into this mini-engine, and even if that were all that was there, I think this provides value and results in a cleaner and more flexible architecture.
I also agree that the current cache implementation is not flexible enough. I'm hoping to touch on this at the Iceberg Summit next month - for my own production service, I extend it further to cache more types of content than what we are already doing so in order to achieve very low latencies for small queries. I'd love a design that made this more pluggable and configurable so that users like me could make heavier use of the cache, but also so that users who are using external query engines can potentially cache less within iceberg-rust itself if they need to.
Thank you all for the discussion. It seems we have a consensus that it's beneficial to have such an engine. I will start working on this.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'