iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

tbl.append(df): schema validation of tbl & df during compares the order & data types

Open sivaraman-ai opened this issue 1 year ago โ€ข 4 comments

Apache Iceberg version

0.6.1

Please describe the bug ๐Ÿž

while writing dataframe to iceberg through tbl.append(df), there happens to be a schema validation of table schema & df schema.

this function in append _check_schema_compatible(self.schema(), other_schema=df.schema) does the schema validation.

here table schema & df schema are converted to pyarrow schema of struct type, and compared with order of dataframe columns with data types.

this results in the following error: Traceback (most recent call last): File "/Users/apple/Projects/bright/brightmoney_collections_system/utils/index.py", line 172, in <module> dff = write_to_iceberg( File "/Users/apple/Projects/bright/brightmoney_collections_system/utils/index.py", line 163, in write_to_iceberg table.append(pyarrow_df) File "/Users/apple/Projects/bright/brightmoney_collections_system/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 1057, in append _check_schema_compatible(self.schema(), other_schema=df.schema) File "/Users/apple/Projects/bright/brightmoney_collections_system/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 175, in _check_schema_compatible raise ValueError(f"Mismatch in fields:\n{console.export_text()}") ValueError: Mismatch in fields: โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“ โ”ƒ โ”ƒ Table field โ”ƒ Dataframe field โ”ƒ โ”กโ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ โ”‚ โœ… โ”‚ 1: a: optional timestamptz โ”‚ 1: a: optional timestamptz โ”‚ โ”‚ โœ… โ”‚ 2: b: optional timestamptz โ”‚ 2: b: optional timestamptz โ”‚ โ”‚ โœ… โ”‚ 3: x: optional string โ”‚ 3: x: optional string โ”‚ โ”‚ โœ… โ”‚ 4: y: optional string โ”‚ 4: y: optional string โ”‚ โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

yet there is no mismatch in field of table & dataframe.

ideally the schema compatibility should not consider the order in which dataframe is send?

sivaraman-ai avatar Aug 22 '24 11:08 sivaraman-ai

when digging deeper, this condition compares the struct with order

this condition checks the schema order & data types as struct

if table_schema.as_struct() != task_schema.as_struct()

if the dataframe which is send to append don't have the columns in order w.r.t to the schema table, write fails because the struct turns about to be this

table schema - struct<1: a: optional timestamptz, 2: b: optional timestamptz, 3: x: optional string, 4: y: optional string> (table columns in this order a, b,x,y) dataframe schema - struct<1: a: optional timestamptz, 2: b: optional timestamptz, y: optional string, 3: x: optional string, 4:> (dataframe columns in this order a,b,y,z)

I think schema validation can be applied to data types of columns instead of order or error message could be more helpful mismatch of fields doesn't make sense here?

thanks

sivaraman-ai avatar Aug 22 '24 11:08 sivaraman-ai

Hi @sivaraman-ai - this was fixed in 0.7.x. Could you try using a newer version of PyIceberg? https://github.com/apache/iceberg-python/pull/921

The latest release is 0.7.1

sungwy avatar Aug 22 '24 14:08 sungwy

Hi @sungwy, thanks

will check with the latest version

sivaraman-ai avatar Aug 27 '24 11:08 sivaraman-ai

We improved _check_schema_compatible since 0.6.1 (see #921)

kevinjqliu avatar Aug 31 '24 13:08 kevinjqliu

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Feb 28 '25 00:02 github-actions[bot]

this is resolved, thanks

sivaraman-ai avatar Feb 28 '25 07:02 sivaraman-ai