iceberg-python Remove support for catalog_name in table identifier string

Currently, we optionally support catalog names being specified in the identifier string. In other words, this means that we currently support identifier names that look like catalog_name.table_namespace.table_name or table_namespace.table_name being passed into catalog.load_table() method.

This PR proposes to remove this optional support, and change the Table class's identifier to be (table_namespace, table_name) instead of (catalog_name, table_namespace, table_name). The reasons are as follows:

The Identifier is specified as namespace + table_name according to the REST Catalog Open API Spec
Table has the attribute catalog - a Table is always instantiated with a Catalog instance. Hence the catalog name is accessible as: tbl.catalog.name, and I feel that prepending the catalog name to the identifier is redundant. If there is value in keeping a long identifier including the catalog name, it may be better to leave the identifier field as namespace + name pair, and update the name attribute to be self.catalog.name + self.identifier instead.
identifier_to_tuple_without_catalog is called in all catalog implementations, and is currently crucial in allowing Table to use the long identifier including the catalog_name. However, this function is imperfect, and will result in errors if the the table_namespace uses a hierarchical name that is repeats the catalog name. For example, namespace: default.lower and catalog_name: default. When we pass default.lower.table_name to refer to the table table_name in namespace default.lower on the catalog default, we will run into an unexpected exception as it will remove the first hierarchy of the namespace, and look for lower.table_name.
Fixes: https://github.com/apache/iceberg-python/issues/742

Thankfully, we have not explicitly documented that we support catalog_name in the identifier string in our API documentation, and other engine partners do not seem to be relying on this feature as well (checking one example of Daft)

It would be great to get this change in for 0.7.0 release as certain rest catalog implementations are having issues with the current approach: https://github.com/hansetag/iceberg-catalog/issues/18

However this is backward incompatible change that takes away an optional feature, so I would like to put this PR up as a first step in facilitating the discussion.

Jul 25 '24 01:07 sungwy

There is some more work to be done on this PR, including:

deprecating identifier_to_tuple_without_catalog public method on the catalog
removing its usage across the board, like in drop_table, etc.

I'd like to seek some feedback from the community before doing the work 🙂

Jul 25 '24 02:07 sungwy

Hey @sungwy thanks for raising this. Some historical context was added from the beginning to copy some of the concepts of how it works in Java/Spark, but to be frank, I don't think it works in PyIceberg since it (not yet) is SQL-based like Spark/Trino/etc:

INSERT INTO dev.schema.table SELECT * FROM prod.schema.table

The Identifier is specified as namespace + table_name according to the REST Catalog Open API Spec

The name is only prepended in PyIceberg and should not be sent to the REST api.

I would be open to removing this, but I'm concerned that it might break existing users. Does anyone have any ideas on how to mitigate this? We could put it behind a flag and make it the default behavior on the next version? But it is already rather messy.

Jul 25 '24 07:07 Fokko

Thanks for raising the PR! I am also +1 on removing this. This seems unnecessary and will inevitably lead to the issue as @sungwy mentioned in the third point:

identifier_to_tuple_without_catalog is called in all catalog implementations, and is currently crucial in allowing Table to use the long identifier including the catalog_name. However, this function is imperfect, and will result in errors if the the table_namespace uses a hierarchical...

if we want to support identifiers with/without catalog's name.

However, removing this means we will break the compatibility.

We could put it behind a flag and make it the default behavior on the next version? But it is already rather messy.

+1 A flag might help, but I also worry that it may still be disruptive if users are unaware of the flag and the decision to remove the feature.

How about sending an email to the dev-list for further discussion and feedback on this change? This could also help us understand the potential impact of this removal on users.

That said, I am inclined to postpone the actual removal to version 0.8.0. There are many updates and excellent new features in version 0.7.0, and I do not want this change to discourage people from upgrading.

I understand that it would be beneficial for version 0.7.0 to address https://github.com/apache/iceberg-python/pull/964, so I have also attempted #964 without removing the catalog name from the identifier. I hope this can further mitigate the issue and unblock version 0.7.0.

Would love to hear your thoughts on this!

Jul 25 '24 09:07 HonahX

Hi @HonahX and @Fokko thank you for the detailed reviews! Given those concerns, I think it would make sense to:

Introduce the fix @HonahX proposed to address #742 without introducing a backward incompatible change
Propose this change to the dev list after this release (0.7.0)

In the long run, I am in favor of this change as I think it makes logical sense and is less error prone. And as @Fokko mentioned above, the current approach is already messy, and removing this optional support would clean up the intersection of our catalog / table.

So I love the idea of aiming tactical mitigation in the short term, and advocating for a strategic fix soon after

Jul 25 '24 10:07 sungwy

+1 to remove the catalog name as part of the table identifier. #964 is a good workaround to get into the 0.7.0 release.

This PR is great as a long-term plan to deprecate the catalog name entirely.

Jul 26 '24 03:07 kevinjqliu

I opened up the mail list discussion: https://lists.apache.org/thread/9zr19hxnbt3hg7lt55t6dfg6otv7zjz2

Unfortunately, there hasn't been a lot of community engagement on the mail list thread. One user responded on the slack channel to express support for the change.

Based on the feedback on this PR, I've opened up https://github.com/apache/iceberg-python/pull/994 to mark this usage pattern with a deprecation warning in 0.8.0. That'll give some time for users to test and prepare for the feature to be removed in 0.9.0.

Aug 03 '24 23:08 sungwy

So #994 will be included in the 0.8.0 release, which will still allow catalog name but with a deprecation warning. And this PR will be included in the 0.9.0 release and will fully remove the catalog name

Aug 07 '24 17:08 kevinjqliu

While working on https://github.com/apache/iceberg-python/pull/1112, I fully support this.

Aug 28 '24 21:08 Fokko

Now that the deprecation is out for 0.8.0 - I'll get working on this large PR again

Dec 07 '24 03:12 sungwy

Hi @Fokko @kevinjqliu - could I ask for your review on this PR? I think we'll be able to introduce this in 0.9.0 now that the deprecation notice has been sent out :)

Dec 10 '24 01:12 sungwy

And there is something very off with the commits, we want to fix that as well, otherwise we'll get a commit with all the authors combined I'm afraid 😅

Dec 18 '24 14:12 Fokko

@Fokko and @kevinjqliu - Thank you both! I'll take a look at fixing the commit history later today

Dec 18 '24 14:12 sungwy

identifier

Thanks for the sharp eye @kevinjqliu - I've removed these as well in the newest commit

Dec 19 '24 00:12 sungwy

And there is something very off with the commits, we want to fix that as well, otherwise we'll get a commit with all the authors combined I'm afraid 😅

Yeah I've tried rebasing it a few times, and didn't fare well since its been a long time since this PR had been up. Had to break out from current main and force push instead to get the commit history issue resolved 🚀

Dec 19 '24 01:12 sungwy

@kevinjqliu and @Fokko - thank you for your reviews!

Dec 19 '24 16:12 sungwy