[suggestion] Write path optimization

Open kevinjqliu opened this issue 1 year ago • 0 comments

Feature Request / Improvement

Let's investigate the level of abstraction on the write path.

Currently, we are doing schema-compatible checks, schema coercion, bin-packing, transformation, etc at different levels of the stack. It'll be good to optimize and see which functions can be pushed up the stack.

For example, here's what the overwrite path looks like

overwrite
	_dataframe_to_data_files
		write_file
			write_parquet

(copied over from https://github.com/apache/iceberg-python/pull/910#pullrequestreview-2175574772)

Another example https://github.com/apache/iceberg-python/pull/786#discussion_r1646417180

More info

overwrite checks schema compatibility https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/table/init.py#L541-L550

_dataframe_to_data_files bin-packs the pyarrow Table https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/io/pyarrow.py#L2222-L2225

write_parquet transforms table schema https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/io/pyarrow.py#L2001-L2008 and https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/io/pyarrow.py#L2011-L2021

Jul 13 '24 19:07 kevinjqliu