iceberg-python
iceberg-python copied to clipboard
Adjust the "table_exists" behavior in the REST Catalog
Apache Iceberg version
0.7.0 (latest release)
Please describe the bug 🐞
Currently, we return True when the status code is 200 or 204, and False for all other status codes. According to the REST specification, we should return False when the catalog returns a 404 status code and raise an error for other status codes.
In the Java implementation, a try/catch block is used with load_table, and when it catches a NoSuchTableError, it currently returns False.
I believe that this behavior is suppressing errors and looks like it's a bug as reported in #1006
@kevinjqliu I noticed that you are looking into the Issue on the Polaris side and I believe that we need to adjust this on our side.
Thanks for looking into this. The behavior you described above seems like the correct one to me.
I believe that this behavior is suppressing errors and looks like it's a bug as reported in https://github.com/apache/iceberg-python/issues/1006
I don't follow this since the expected value should be True. According to the spec, the function returns True only for the 204 status code.
There's a theory in the other thread (https://github.com/polaris-catalog/polaris/issues/96#issuecomment-2272474147) that the error is due to managed Polaris not implementing the HEAD API, which would cause any HEAD request to return 404.
Reference
@ndrluis @kevinjqliu I can contribute to this. Would you say the intended behavior should be:
- 204 return True
- 404 NoSuchTableException
- others: raise a corresponding response error
Does it make sense for a method suffixed with "exists" to never return false? As I mentioned, and as it is in the Java implementation, when a 404 status is returned, it should indicate false.
However, as @kevinjqliu pointed out, we might receive a 404 when the HEAD method is not implemented. I believe we should expect the Specification to be implemented as defined, and if it's not, it's a bug on the catalog side and not on ours.
Therefore, I believe we can follow the Java implementation in the sense that it returns false when a 404 status is encountered, but we should maintain the use of the HEAD call because it is expected to exist according to the Catalog spec.
@kevinjqliu @TiansuYu @Fokko @sungwy @HonahX What are your thoughts?
Hi @TiansuYu and @ndrluis - thank you for bringing up this point, and sorry for not getting around to looking at this earlier.
Similar to what @TiansuYu suggested, I'm of the opinion that we should do the following:
- 200, 204: return True
- 404: return False
- others, raise a corresponding response error.
The reason is, because there's a numerous different factors as to whether a REST endpoint will return a non non 204/404 response, and it would be erroneous for us to return false on any other status code. If users are relying on this endpoint to return affirmatively say that the table exists or not for their use case, and if the REST catalog returns a 5xx error due to an unknown reason, having that be interpreted as False is error-prone
@ndrluis @kevinjqliu I can contribute to this. Would you say the intended behavior should be:
- 204 return True
- 404 NoSuchTableException
- others: raise a corresponding response error
It looks like @ndrluis raised this issue, and has it assigned this to himself. Maybe we could help review his code :)
Thanks everyone for bringing this up.
Long term, I think we should leverage the endpoint discovery to see if the server provides the capability of table-exists: https://lists.apache.org/thread/8h86382omdx9cmvc15m2bf361p5rz4rk
Since the table-exists is in there from the beginning, I think we should follow the spec, and assume that:
- 204: Table exists.
- 404: Table doesn't exist.
- Raise an exception for the rest, similar to other operations.
The HEAD operation does not require the REST Catalog to fetch the metadata, which will consume less resources on the server, but probably also return much faster for the client.
I think need to update the Java implementation to use the HEAD request, this will also uncover the issue on the Java side. For validation, hopefully, we can check this properly using https://github.com/apache/iceberg/pull/10908 once that is in.
I checked the Java side, and I could not find the HEAD request. For raising visibility, I created an issue there: https://github.com/apache/iceberg/issues/10993
@ndrluis can this be closed now that #1096 is merged?
still getting errors with with pyiceberg 0.8 rc1, when running this code
tbl = db+"."+"mstdatetime"
if not catalog.table_exists(tbl):
df=duckdb.sql(""" SELECT cast(unnest(generate_series(cast ('2018-04-01' as date), cast('2024-12-31' as date), interval 5 minute)) as TIMESTAMPTZ) as SETTLEMENTDATE,
strftime(SETTLEMENTDATE, '%I:%M:%S %p') as time,
cast(SETTLEMENTDATE as date ) as date,
EXTRACT(year from date) as year,
EXTRACT(month from date) as month
""").arrow()
catalog.create_table(tbl,schema=df.schema)
catalog.load_table(tbl).overwrite(df)
print('calendar created')
else:
print("table exist already")
HTTPError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyiceberg/catalog/rest.py](https://localhost:8080/#) in _create_table(self, identifier, schema, location, partition_spec, sort_order, properties, stage_create)
598 try:
--> 599 response.raise_for_status()
600 except HTTPError as exc:
11 frames
HTTPError: 409 Client Error: Conflict for url: https://xxxx-polaris.snowflakecomputing.com/polaris/api/catalog/v1/dwh/namespaces/aemo/tables
The above exception was the direct cause of the following exception:
TableAlreadyExistsError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyiceberg/catalog/rest.py](https://localhost:8080/#) in _handle_non_200_response(self, exc, error_handler)
470 )
471
--> 472 raise exception(response) from exc
473
474 def _init_sigv4(self, session: Session) -> None:
TableAlreadyExistsError: AlreadyExistsException: Table already exists: aemo.mstdatetime
I think the issue here is that catalog.table_exists(tbl) returns False when the table already exists.
@djouallah can you check what the the response/status code is for catalog.table_exists?
Sorry, how to do that ?
something like this, following the table_exists implementation
https://github.com/apache/iceberg-python/blob/b7942a85dfb74ce3736c5088995e7bd0b996d56b/pyiceberg/catalog/rest.py#L875-L888
from pyiceberg.catalog.rest import Endpoints
identifier = tbl
identifier_tuple = catalog._identifier_to_tuple_without_catalog(identifier)
response = catalog._session.head(
catalog.url(Endpoints.load_table, prefixed=True, **catalog._split_identifier_for_path(identifier_tuple))
)
<Response [404]>
404 would make table_exists return False
https://github.com/apache/iceberg-python/blob/b7942a85dfb74ce3736c5088995e7bd0b996d56b/pyiceberg/catalog/rest.py#L889-L890
what is the url its hitting?
catalog.url(Endpoints.load_table, prefixed=True, **catalog._split_identifier_for_path(identifier_tuple))
while you're at it, can you print out these variables
identifier
identifier_tuple
catalog._split_identifier_for_path(identifier_tuple)
This is either
- the pyiceberg client somehow sending the wrong request
- polaris server returning wrong result
catalog.url(Endpoints.load_table, prefixed=True, **catalog._split_identifier_for_path(identifier_tuple))
https://xxxxx-polaris.snowflakecomputing.com/polaris/api/catalog/v1/dwh/namespaces/demo/tables/scada
identifier
identifier_tuple
catalog._split_identifier_for_path(identifier_tuple)
response
demo.scada
('demo', 'scada')
{'namespace': 'demo', 'table': 'scada'}
<Response [404]>