trino-python-client icon indicating copy to clipboard operation
trino-python-client copied to clipboard

Fix spurious warnings and bogus index when reflecting Iceberg tables

Open metadaddy opened this issue 1 year ago • 9 comments

Description

Previously, during reflection, TrinoDialect.get_indexes() called _get_partitions(), which executed

SELECT * FROM {schema}."{table_name}$partitions"

and assumed that, if this query executed successfully, the table in hand was a Hive table. The issue (#518) was that this same query also succeeds for Iceberg tables, resulting in a spurious index (containing no columns) being added to the table, and a series of warnings from SQLAlchemy about index keys not being located in the column names for the table.

This PR adds a check to TrinoDialect.get_indexes() to ensure that the catalog in hand is a Hive catalog before calling _get_partitions().

I looked at creating a test method, but I couldn't see how to do so. Instead, I bench-tested the fix against Hive and Iceberg catalogs with the test app from #518. The output for an Iceberg table is:

Calling reflect()

Listing tables:
Table name: drivestats

For a Hive table it is:

Calling reflect()

Listing tables:
Table name: drivestats
	Index name: partition
		Column name: drivestats.year
		Column name: drivestats.month

Non-technical explanation

This PR fixes spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables.

Release notes

( ) This is not user-visible or docs only and no release notes are required. ( ) Release notes are required, please propose a release note for me. (x) Release notes are required, with the following suggested text:

* Fix spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables. ({issue}`518`)

Fixes https://github.com/trinodb/trino-python-client/issues/518

metadaddy avatar Jan 15 '25 23:01 metadaddy

I'll see if I can figure out a way to make it work on 351.

metadaddy avatar Jan 15 '25 23:01 metadaddy

The 351 check passes now. This is the query that the code uses to check whether the catalog's connector is Hive. If it returns 1, then it's Hive.

SELECT
    COUNT(*)
FROM "system"."metadata"."table_properties"
WHERE "catalog_name" = :catalog_name
  AND "property_name" = 'bucketing_version'

If there's a better way to do this that passes the 351 test, I'd be happy to change the code!

metadaddy avatar Jan 16 '25 03:01 metadaddy

for reference the earlier alternative was reading from "system"."metadata"."catalogs". I think for this case it's fine to let the warnings happen for versions which don't have system.metadata.catalogs. I don't like replying on table properties since nothing guarantees the property won't get added to Iceberg for example.

For the test then you can make it "pass" by mark.skipif (or whatever it's called).

cc: @damian3031 @ebyhr @dungdm93 what do you think?

hashhar avatar Jan 16 '25 16:01 hashhar

@hashhar I realized I could query "system"."information_schema"."columns" to see if the connector_name column is there. So, the warning is shown on old versions and all tests pass without any additional @pytest.mark.skipif annotations.

metadaddy avatar Jan 24 '25 02:01 metadaddy

Python 3.10 should be happy now... We'll see!

metadaddy avatar Feb 20 '25 22:02 metadaddy

All tests pass, @hashhar - ready for your renewed attention 🙂

metadaddy avatar Feb 20 '25 22:02 metadaddy

Thanks @metadaddy, I'll test locally once today and then merge. Sorry for the delays and long back and forth.

hashhar avatar Feb 24 '25 06:02 hashhar

Hi @hashhar - no problem at all - as much of the delay was on my side as yours, and I think we arrived at the correct solution!

metadaddy avatar Feb 24 '25 17:02 metadaddy

Hi @hashhar - just bumping this so it doesn't get forgotten 😉

metadaddy avatar Mar 10 '25 01:03 metadaddy