iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support writing to a table with sort-order

Open Fokko opened this issue 1 year ago • 9 comments

Feature Request / Improvement

We fail when we see a sort order, it would be great if we could sort+write the data based on the sort-order.

Fokko avatar Jan 16 '24 12:01 Fokko

Hi @Fokko I would like to give a shot at this if no one has already taken it.

vinjai avatar Jun 06 '24 03:06 vinjai

@vinjai Great, I've assigned it to you!

Fokko avatar Jun 07 '24 07:06 Fokko

Hey @Fokko

Question around transformation defined in SortOrder: Our input for sorting is a pyarrow table in the append and override methods. Here we have two options:

  1. Sort the pyarrow table using the pyarrow transform. A lot of transforms have not been implemented or supported in pyarrow. This introduces two scenarios:
    • Breaking Change: If the transformation is not supported, we don't write ahead raising an appropriate error.
    • Silently moving ahead: Add unsorted order id and ignore the sort functionality altogether
  2. Convert the pyarrow table to a python object and then sort within python.

I am more in favor of the first one.

vinjai avatar Jul 01 '24 09:07 vinjai

For instance (in BucketTransform): https://github.com/apache/iceberg-python/blob/0bf175d25de706a3aa094d81093faff4057295be/pyiceberg/transforms.py#L303-L304

vinjai avatar Jul 02 '24 05:07 vinjai

@vinjai Since we ignore the write-order today, I think proceeding is fine. Maybe raise a warning so the user knows the data isn't being sorted. Sorting in Python would be very costly.

Fokko avatar Jul 03 '24 05:07 Fokko

Let's pass this to the next release when we have all the transforms implemented using the Rust extension. cc @sungwy

Fokko avatar Oct 30 '24 19:10 Fokko