Remove support for catalog_name in table identifier string
Currently, we optionally support catalog names being specified in the identifier string. In other words, this means that we currently support identifier names that look like catalog_name.table_namespace.table_name or table_namespace.table_name being passed into catalog.load_table() method.
This PR proposes to remove this optional support, and change the Table class's identifier to be (table_namespace, table_name) instead of (catalog_name, table_namespace, table_name). The reasons are as follows:
- The Identifier is specified as namespace + table_name according to the REST Catalog Open API Spec
Tablehas the attributecatalog- a Table is always instantiated with a Catalog instance. Hence the catalog name is accessible as:tbl.catalog.name, and I feel that prepending the catalog name to the identifier is redundant. If there is value in keeping a long identifier including the catalog name, it may be better to leave theidentifierfield as namespace + name pair, and update thenameattribute to beself.catalog.name+self.identifierinstead.identifier_to_tuple_without_catalogis called in allcatalogimplementations, and is currently crucial in allowingTableto use the long identifier including the catalog_name. However, this function is imperfect, and will result in errors if the the table_namespace uses a hierarchical name that is repeats the catalog name. For example, namespace:default.lowerand catalog_name:default. When we passdefault.lower.table_nameto refer to the tabletable_namein namespacedefault.loweron the catalogdefault, we will run into an unexpected exception as it will remove the first hierarchy of the namespace, and look forlower.table_name.- Fixes: https://github.com/apache/iceberg-python/issues/742
Thankfully, we have not explicitly documented that we support catalog_name in the identifier string in our API documentation, and other engine partners do not seem to be relying on this feature as well (checking one example of Daft)
It would be great to get this change in for 0.7.0 release as certain rest catalog implementations are having issues with the current approach: https://github.com/hansetag/iceberg-catalog/issues/18
However this is backward incompatible change that takes away an optional feature, so I would like to put this PR up as a first step in facilitating the discussion.
There is some more work to be done on this PR, including:
- deprecating
identifier_to_tuple_without_catalogpublic method on the catalog - removing its usage across the board, like in drop_table, etc.
I'd like to seek some feedback from the community before doing the work 🙂
Hey @sungwy thanks for raising this. Some historical context was added from the beginning to copy some of the concepts of how it works in Java/Spark, but to be frank, I don't think it works in PyIceberg since it (not yet) is SQL-based like Spark/Trino/etc:
INSERT INTO dev.schema.table SELECT * FROM prod.schema.table
The Identifier is specified as namespace + table_name according to the REST Catalog Open API Spec
The name is only prepended in PyIceberg and should not be sent to the REST api.
I would be open to removing this, but I'm concerned that it might break existing users. Does anyone have any ideas on how to mitigate this? We could put it behind a flag and make it the default behavior on the next version? But it is already rather messy.
Thanks for raising the PR! I am also +1 on removing this. This seems unnecessary and will inevitably lead to the issue as @sungwy mentioned in the third point:
identifier_to_tuple_without_catalog is called in all catalog implementations, and is currently crucial in allowing Table to use the long identifier including the catalog_name. However, this function is imperfect, and will result in errors if the the table_namespace uses a hierarchical...
if we want to support identifiers with/without catalog's name.
However, removing this means we will break the compatibility.
We could put it behind a flag and make it the default behavior on the next version? But it is already rather messy.
+1 A flag might help, but I also worry that it may still be disruptive if users are unaware of the flag and the decision to remove the feature.
How about sending an email to the dev-list for further discussion and feedback on this change? This could also help us understand the potential impact of this removal on users.
That said, I am inclined to postpone the actual removal to version 0.8.0. There are many updates and excellent new features in version 0.7.0, and I do not want this change to discourage people from upgrading.
I understand that it would be beneficial for version 0.7.0 to address https://github.com/apache/iceberg-python/pull/964, so I have also attempted #964 without removing the catalog name from the identifier. I hope this can further mitigate the issue and unblock version 0.7.0.
Would love to hear your thoughts on this!
Hi @HonahX and @Fokko thank you for the detailed reviews! Given those concerns, I think it would make sense to:
- Introduce the fix @HonahX proposed to address #742 without introducing a backward incompatible change
- Propose this change to the dev list after this release (0.7.0)
In the long run, I am in favor of this change as I think it makes logical sense and is less error prone. And as @Fokko mentioned above, the current approach is already messy, and removing this optional support would clean up the intersection of our catalog / table.
So I love the idea of aiming tactical mitigation in the short term, and advocating for a strategic fix soon after
+1 to remove the catalog name as part of the table identifier. #964 is a good workaround to get into the 0.7.0 release.
This PR is great as a long-term plan to deprecate the catalog name entirely.
I opened up the mail list discussion: https://lists.apache.org/thread/9zr19hxnbt3hg7lt55t6dfg6otv7zjz2
Unfortunately, there hasn't been a lot of community engagement on the mail list thread. One user responded on the slack channel to express support for the change.
Based on the feedback on this PR, I've opened up https://github.com/apache/iceberg-python/pull/994 to mark this usage pattern with a deprecation warning in 0.8.0. That'll give some time for users to test and prepare for the feature to be removed in 0.9.0.
So #994 will be included in the 0.8.0 release, which will still allow catalog name but with a deprecation warning. And this PR will be included in the 0.9.0 release and will fully remove the catalog name
While working on https://github.com/apache/iceberg-python/pull/1112, I fully support this.
Now that the deprecation is out for 0.8.0 - I'll get working on this large PR again
Hi @Fokko @kevinjqliu - could I ask for your review on this PR? I think we'll be able to introduce this in 0.9.0 now that the deprecation notice has been sent out :)
And there is something very off with the commits, we want to fix that as well, otherwise we'll get a commit with all the authors combined I'm afraid 😅
@Fokko and @kevinjqliu - Thank you both! I'll take a look at fixing the commit history later today
- identifier
Thanks for the sharp eye @kevinjqliu - I've removed these as well in the newest commit
And there is something very off with the commits, we want to fix that as well, otherwise we'll get a commit with all the authors combined I'm afraid 😅
Yeah I've tried rebasing it a few times, and didn't fare well since its been a long time since this PR had been up. Had to break out from current main and force push instead to get the commit history issue resolved 🚀
@kevinjqliu and @Fokko - thank you for your reviews!