iceberg-python
iceberg-python copied to clipboard
Support writing to a table with sort-order
Feature Request / Improvement
We fail when we see a sort order, it would be great if we could sort+write the data based on the sort-order.
Hi @Fokko I would like to give a shot at this if no one has already taken it.
@vinjai Great, I've assigned it to you!
Hey @Fokko
Question around transformation defined in SortOrder: Our input for sorting is a pyarrow table in the append and override methods. Here we have two options:
- Sort the pyarrow table using the pyarrow transform. A lot of transforms have not been implemented or supported in pyarrow. This introduces two scenarios:
- Breaking Change: If the transformation is not supported, we don't write ahead raising an appropriate error.
- Silently moving ahead: Add unsorted order id and ignore the sort functionality altogether
- Convert the pyarrow table to a python object and then sort within python.
I am more in favor of the first one.
For instance (in BucketTransform): https://github.com/apache/iceberg-python/blob/0bf175d25de706a3aa094d81093faff4057295be/pyiceberg/transforms.py#L303-L304
@vinjai Since we ignore the write-order today, I think proceeding is fine. Maybe raise a warning so the user knows the data isn't being sorted. Sorting in Python would be very costly.
Let's pass this to the next release when we have all the transforms implemented using the Rust extension. cc @sungwy