iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

feat: Partition Binding and safe PartitionSpecBuilder

Open c-thiel opened this issue 1 year ago • 5 comments

@Fokko , @liurenjie1024, @Xuanwo next up is PartitionBinding!

I went ahead with a proposal so that we have something concrete to discuss about. Let me explain a few decisions that I took:

  1. I tried to keep the API change as small as possible, however opted for a fallible version of with_partition_field. This gives early feedback and allows more fine-granular builds. If this fine-granular control isn't wanted by the user, there is a fields method that accepts multiple fields and yields a single error.
  2. Compatibility checks against the schema are performed only when calling build. This allows us to use the same builder including many of its checks also for UnboundPartitionSpec.
  3. I have more checks than Java I think. If I am not mistaken the Java builder trusts provided field_ids blindly (https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L559). We don't.
  4. I opted for a reduced builder state and instead an iteration when checking for partition_names. I believe this to be more readable.
  5. Re-Binding for bound PartitionSpecs is possible. However I skipped an explicit method to re-index fields. This is something needed in the TableMetadataBuilder I believe. re-indexing fields can however easily be achieved by .to_unbound() partition fields and then setting partition_id to None - all is part of the public API.
  6. I don't like that the PartitionSpec and UnboundPartitionSpec fields are public. Now that we have the builder, I would prefer to make them private to reduce errors by just using the constructor - similar to TableMetadata. I have left them public for now and waiting for your comments.

Any feedback welcome!

c-thiel avatar Jul 27 '24 16:07 c-thiel

I have a few ToDos in the code. Most of them are for tests in other modules that would fail with proper partition binding.

If someone has insights on the following, it would also be very helpful: Should we compare transform.deduped_name to find redundant partitionings or the transformation itself? If we switch to deduped_name, we would not support year & month & day partitions on the same column, as it is always "time".

deduped_name would probably be more correct for bucket transform, to disallow different bucket sizes on the same column. Java seems to use deduped_name: https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L420

Does java disalow year & month partitions on the same column? If so our tests are broken https://github.com/apache/iceberg-rust/blob/de450c80bbe5d7d7dcb919ded7015ca447d8edb1/crates/iceberg/src/expr/visitors/inclusive_projection.rs#L329-L356

c-thiel avatar Jul 27 '24 17:07 c-thiel

I have a few ToDos in the code. Most of them are for tests in other modules that would fail with proper partition binding.

If someone has insights on the following, it would also be very helpful: Should we compare transform.deduped_name to find redundant partitionings or the transformation itself? If we switch to deduped_name, we would not support year & month & day partitions on the same column, as it is always "time".

deduped_name would probably be more correct for bucket transform, to disallow different bucket sizes on the same column. Java seems to use deduped_name: https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L420

Does java disalow year & month partitions on the same column? If so our tests are broken

https://github.com/apache/iceberg-rust/blob/de450c80bbe5d7d7dcb919ded7015ca447d8edb1/crates/iceberg/src/expr/visitors/inclusive_projection.rs#L329-L356

I checked spark and it also fails: pyspark.errors.exceptions.captured.IllegalArgumentException: Cannot add redundant partition: 1000: ts_year: year(2) conflicts with 1001: ts_month: month(2)

I adapted the broken tests for predicates. @liurenjie1024 it would be good to have an extra vigilant eye on those. Most changed needed to be done due to incompatible schemas for certain partition specs.

c-thiel avatar Aug 06 '24 08:08 c-thiel

I checked spark and it also fails: pyspark.errors.exceptions.captured.IllegalArgumentException: Cannot add redundant partition: 1000: ts_year: year(2) conflicts with 1001: ts_month: month(2)

I think they don't allow, see these two methods: https://github.com/apache/iceberg/blob/afda8be25652d44d9339a79c6797b6bf20c55bd6/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L418 https://github.com/apache/iceberg/blob/c4f9337a61bf9f4e80f058e3f77b9cb8f617522e/api/src/main/java/org/apache/iceberg/transforms/Dates.java#L208

liurenjie1024 avatar Aug 07 '24 15:08 liurenjie1024

I adapted the broken tests for predicates. @liurenjie1024 it would be good to have an extra vigilant eye on those. Most changed needed to be done due to incompatible schemas for certain partition specs.

Also cc @Fokko to take a look.

liurenjie1024 avatar Aug 07 '24 15:08 liurenjie1024

@liurenjie1024 ready for another round!

c-thiel avatar Aug 12 '24 16:08 c-thiel