iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Adjust the "table_exists" behavior in the REST Catalog

Open ndrluis opened this issue 1 year ago • 1 comments

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

Currently, we return True when the status code is 200 or 204, and False for all other status codes. According to the REST specification, we should return False when the catalog returns a 404 status code and raise an error for other status codes.

In the Java implementation, a try/catch block is used with load_table, and when it catches a NoSuchTableError, it currently returns False.

I believe that this behavior is suppressing errors and looks like it's a bug as reported in #1006

@kevinjqliu I noticed that you are looking into the Issue on the Polaris side and I believe that we need to adjust this on our side.

ndrluis avatar Aug 07 '24 15:08 ndrluis

Thanks for looking into this. The behavior you described above seems like the correct one to me.

I believe that this behavior is suppressing errors and looks like it's a bug as reported in https://github.com/apache/iceberg-python/issues/1006

I don't follow this since the expected value should be True. According to the spec, the function returns True only for the 204 status code. There's a theory in the other thread (https://github.com/polaris-catalog/polaris/issues/96#issuecomment-2272474147) that the error is due to managed Polaris not implementing the HEAD API, which would cause any HEAD request to return 404.

Reference

kevinjqliu avatar Aug 07 '24 16:08 kevinjqliu

@ndrluis @kevinjqliu I can contribute to this. Would you say the intended behavior should be:

  • 204 return True
  • 404 NoSuchTableException
  • others: raise a corresponding response error

TiansuYu avatar Aug 23 '24 07:08 TiansuYu

Does it make sense for a method suffixed with "exists" to never return false? As I mentioned, and as it is in the Java implementation, when a 404 status is returned, it should indicate false.

However, as @kevinjqliu pointed out, we might receive a 404 when the HEAD method is not implemented. I believe we should expect the Specification to be implemented as defined, and if it's not, it's a bug on the catalog side and not on ours.

Therefore, I believe we can follow the Java implementation in the sense that it returns false when a 404 status is encountered, but we should maintain the use of the HEAD call because it is expected to exist according to the Catalog spec.

@kevinjqliu @TiansuYu @Fokko @sungwy @HonahX What are your thoughts?

ndrluis avatar Aug 23 '24 12:08 ndrluis

Hi @TiansuYu and @ndrluis - thank you for bringing up this point, and sorry for not getting around to looking at this earlier.

Similar to what @TiansuYu suggested, I'm of the opinion that we should do the following:

  • 200, 204: return True
  • 404: return False
  • others, raise a corresponding response error.

The reason is, because there's a numerous different factors as to whether a REST endpoint will return a non non 204/404 response, and it would be erroneous for us to return false on any other status code. If users are relying on this endpoint to return affirmatively say that the table exists or not for their use case, and if the REST catalog returns a 5xx error due to an unknown reason, having that be interpreted as False is error-prone

sungwy avatar Aug 23 '24 14:08 sungwy

@ndrluis @kevinjqliu I can contribute to this. Would you say the intended behavior should be:

  • 204 return True
  • 404 NoSuchTableException
  • others: raise a corresponding response error

It looks like @ndrluis raised this issue, and has it assigned this to himself. Maybe we could help review his code :)

sungwy avatar Aug 23 '24 14:08 sungwy

Thanks everyone for bringing this up.

Long term, I think we should leverage the endpoint discovery to see if the server provides the capability of table-exists: https://lists.apache.org/thread/8h86382omdx9cmvc15m2bf361p5rz4rk

Since the table-exists is in there from the beginning, I think we should follow the spec, and assume that:

  • 204: Table exists.
  • 404: Table doesn't exist.
  • Raise an exception for the rest, similar to other operations.

The HEAD operation does not require the REST Catalog to fetch the metadata, which will consume less resources on the server, but probably also return much faster for the client.

I think need to update the Java implementation to use the HEAD request, this will also uncover the issue on the Java side. For validation, hopefully, we can check this properly using https://github.com/apache/iceberg/pull/10908 once that is in.

Fokko avatar Aug 23 '24 15:08 Fokko

I checked the Java side, and I could not find the HEAD request. For raising visibility, I created an issue there: https://github.com/apache/iceberg/issues/10993

Fokko avatar Aug 23 '24 15:08 Fokko

@ndrluis can this be closed now that #1096 is merged?

kevinjqliu avatar Sep 01 '24 14:09 kevinjqliu

still getting errors with with pyiceberg 0.8 rc1, when running this code

tbl = db+"."+"mstdatetime"
if not catalog.table_exists(tbl):
  df=duckdb.sql(""" SELECT cast(unnest(generate_series(cast ('2018-04-01' as date), cast('2024-12-31' as date), interval 5 minute)) as TIMESTAMPTZ) as SETTLEMENTDATE,
           strftime(SETTLEMENTDATE, '%I:%M:%S %p') as time,
           cast(SETTLEMENTDATE as date ) as date,
           EXTRACT(year from date) as year,
           EXTRACT(month from date) as month
           """).arrow()


  catalog.create_table(tbl,schema=df.schema)
  catalog.load_table(tbl).overwrite(df)
  print('calendar created')
else:
    print("table exist already")
HTTPError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyiceberg/catalog/rest.py](https://localhost:8080/#) in _create_table(self, identifier, schema, location, partition_spec, sort_order, properties, stage_create)
    598         try:
--> 599             response.raise_for_status()
    600         except HTTPError as exc:

11 frames
HTTPError: 409 Client Error: Conflict for url: https://xxxx-polaris.snowflakecomputing.com/polaris/api/catalog/v1/dwh/namespaces/aemo/tables

The above exception was the direct cause of the following exception:

TableAlreadyExistsError                   Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyiceberg/catalog/rest.py](https://localhost:8080/#) in _handle_non_200_response(self, exc, error_handler)
    470             )
    471 
--> 472         raise exception(response) from exc
    473 
    474     def _init_sigv4(self, session: Session) -> None:

TableAlreadyExistsError: AlreadyExistsException: Table already exists: aemo.mstdatetime

djouallah avatar Nov 11 '24 04:11 djouallah

I think the issue here is that catalog.table_exists(tbl) returns False when the table already exists.

@djouallah can you check what the the response/status code is for catalog.table_exists?

kevinjqliu avatar Nov 11 '24 17:11 kevinjqliu

Sorry, how to do that ?

djouallah avatar Nov 12 '24 01:11 djouallah

something like this, following the table_exists implementation https://github.com/apache/iceberg-python/blob/b7942a85dfb74ce3736c5088995e7bd0b996d56b/pyiceberg/catalog/rest.py#L875-L888

from pyiceberg.catalog.rest import Endpoints

identifier = tbl
identifier_tuple = catalog._identifier_to_tuple_without_catalog(identifier)
response = catalog._session.head(
    catalog.url(Endpoints.load_table, prefixed=True, **catalog._split_identifier_for_path(identifier_tuple))
)

kevinjqliu avatar Nov 12 '24 05:11 kevinjqliu

<Response [404]>

djouallah avatar Nov 12 '24 06:11 djouallah

404 would make table_exists return False https://github.com/apache/iceberg-python/blob/b7942a85dfb74ce3736c5088995e7bd0b996d56b/pyiceberg/catalog/rest.py#L889-L890

what is the url its hitting?

catalog.url(Endpoints.load_table, prefixed=True, **catalog._split_identifier_for_path(identifier_tuple))

while you're at it, can you print out these variables

identifier
identifier_tuple
catalog._split_identifier_for_path(identifier_tuple)

This is either

  • the pyiceberg client somehow sending the wrong request
  • polaris server returning wrong result

kevinjqliu avatar Nov 12 '24 16:11 kevinjqliu

catalog.url(Endpoints.load_table, prefixed=True, **catalog._split_identifier_for_path(identifier_tuple))

https://xxxxx-polaris.snowflakecomputing.com/polaris/api/catalog/v1/dwh/namespaces/demo/tables/scada


identifier
identifier_tuple
catalog._split_identifier_for_path(identifier_tuple)
response

demo.scada
('demo', 'scada')
{'namespace': 'demo', 'table': 'scada'}
<Response [404]>

djouallah avatar Nov 12 '24 23:11 djouallah