delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

fix: remove unnecessary metadata action when overwriting partition

Open PeterKeDer opened this issue 1 year ago • 0 comments

Description

Fixes spurious metadata action when write_deltalake is called with mode overwrite, using a predicate and with a string partition column. This is undesirable because all concurrent writes will fail and need to be retried due to this metadata action.

This is caused by the schema != table_schema check. table_schema is from calling table.input_schema() which converts string partition columns to dictionary and causes it to be different from schema.

We fix this issue by comparing it using try_cast_batch, so the behavior becomes identical to writing with mode='append'.

To replicate (on deltalake==0.20.1):

from deltalake import DeltaTable, write_deltalake
import polars as pl

df1 = pl.DataFrame({'id': ['a', 'b'], 'val': [1,2]})

write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])
write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])
write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])

If we look at the latter 2 transaction JSONs, they will have a metadata action indicating a schema change, even though the schema is identical.

PeterKeDer avatar Oct 05 '24 00:10 PeterKeDer