Add view support to the Rest Catalog
Feature Request / Improvement
- [ ] List Views - #817
- [ ] Drop View - #820
- [ ] View Exists
- [ ] Rename View
- [ ] Replace View
- [ ] Load View
- [ ] Create View
Reference: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml
Thank you for raising this @ndrluis 💯 I will add this as a 0.8.0 milestone for now
Would love to take a first stab at this @kevinjqliu, could you assign this to me? edit: here's a PR for view_exists: #1242. Thanks!
I am really curious about how Load View should work, given that currently only SQL representations of views are supported and I don't think we have an in-process SQL engine that can convert SQL into an iceberg scan plan (yet/at all?).
@shiv-io did you already have some thoughts there?
Following what @danielcweeks said in this email, I believe we could discuss and experiment with SQLGlot to create support for other dialects. However, to support load views, we likely need to rely on a query engine. I'm not sure if there is a query engine in the Python ecosystem that would make sense to support, but I feel that we could use Apache DataFusion through the iceberg-rust implementation or the Python bindings.
That's an interesting question @corleyma . The way I see it, PyIceberg is a language library, that tries to remain open to any Python based query engine that wants to make use of its functions to process Iceberg tables. So I think the first step in introducing view support in PyIceberg would be for us to fetch the view representations from the REST Catalog endpoint and serve the view representations to any query engines that want to integrate with it (like Daft).
I agree with @ndrluis though, that it would be cool to leverage projects like DataFusion to improve the way we load, slice and dice the tables in PyIceberg.
I agree with @sungwy that the primary goal of pyiceberg should be to make it possible for query engines to interface with Iceberg tables and views.
Nonetheless, it would be really ideal to have some out of the box way to get a scan of a view (PyArrow Dataset-like is the most ideal, but returning Table/RecordBatchReader like current table scan functionality is a fine endpoint). This is ideal because it provides an easy path for integrating with other things (like polars) that currently support pyiceberg tables, and because it will benefit use of pyiceberg for more operational concerns e.g. being able to easily preview view contents, etc.
I think DataFusion (either via Python bindings or via iceberg-rust) would be a great way to accomplish this goal. Since (I think?) pyiceberg is much further along in implementing the iceberg sdk than iceberg-rust, it would be interesting if it were possible for pyiceberg to use DataFusion directly but I suspect you need some custom rust code no matter what?
I'm fairly new to the Iceberg ecosystem -- thanks for the insightful discussion, looks like I have some reading to do before I can weigh in.
load_view aside though, I'd love to work on the other view features if contributions towards this issue are being accepted.
@shiv-io It should still be possible to do load_view without supporting any scanning functionality yet, and like @sungwy says, that is likely a necessary precursor for other query engines anyway.
look at how load_table works today: we return a Table model with all the metadata about the table, and this model exposes functionality for data scans, etc. So load_view would start with returning a model with all the metadata about the view (as specified in the spec), and then we can look at trying to add some DataFusion-based scan functionality in subsequent iterations.
look at how load_table works today: we return a Table model with all the metadata about the table, and this model exposes functionality for data scans, etc. So load_view would start with returning a model with all the metadata about the view (as specified in the spec), and then we can look at trying to add some DataFusion-based scan functionality in subsequent iterations.
+1, I think it's a good idea to separate accessing the iceberg views from using them. The ability to read an iceberg view is great for general view operations. Even printing out what the view definition is would be a great feature to have.
Connecting the view with an external engine can be a separate story.
It looks like view_exists has been implemented