feat: Partition Binding and safe PartitionSpecBuilder
@Fokko , @liurenjie1024, @Xuanwo next up is PartitionBinding!
I went ahead with a proposal so that we have something concrete to discuss about. Let me explain a few decisions that I took:
- I tried to keep the API change as small as possible, however opted for a fallible version of
with_partition_field. This gives early feedback and allows more fine-granular builds. If this fine-granular control isn't wanted by the user, there is afieldsmethod that accepts multiple fields and yields a single error. - Compatibility checks against the schema are performed only when calling
build. This allows us to use the same builder including many of its checks also forUnboundPartitionSpec. - I have more checks than Java I think. If I am not mistaken the Java builder trusts provided field_ids blindly (https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L559). We don't.
- I opted for a reduced builder state and instead an iteration when checking for partition_names. I believe this to be more readable.
- Re-Binding for bound
PartitionSpecs is possible. However I skipped an explicit method to re-index fields. This is something needed in theTableMetadataBuilderI believe. re-indexing fields can however easily be achieved by.to_unbound()partition fields and then settingpartition_idto None - all is part of the public API. - I don't like that the
PartitionSpecandUnboundPartitionSpecfields are public. Now that we have the builder, I would prefer to make them private to reduce errors by just using the constructor - similar toTableMetadata. I have left them public for now and waiting for your comments.
Any feedback welcome!
I have a few ToDos in the code. Most of them are for tests in other modules that would fail with proper partition binding.
If someone has insights on the following, it would also be very helpful: Should we compare transform.deduped_name to find redundant partitionings or the transformation itself? If we switch to deduped_name, we would not support year & month & day partitions on the same column, as it is always "time".
deduped_name would probably be more correct for bucket transform, to disallow different bucket sizes on the same column. Java seems to use deduped_name: https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L420
Does java disalow year & month partitions on the same column? If so our tests are broken https://github.com/apache/iceberg-rust/blob/de450c80bbe5d7d7dcb919ded7015ca447d8edb1/crates/iceberg/src/expr/visitors/inclusive_projection.rs#L329-L356
I have a few
ToDos in the code. Most of them are for tests in other modules that would fail with proper partition binding.If someone has insights on the following, it would also be very helpful: Should we compare transform.deduped_name to find redundant partitionings or the transformation itself? If we switch to deduped_name, we would not support year & month & day partitions on the same column, as it is always "time".
deduped_name would probably be more correct for bucket transform, to disallow different bucket sizes on the same column. Java seems to use deduped_name: https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L420
Does java disalow year & month partitions on the same column? If so our tests are broken
https://github.com/apache/iceberg-rust/blob/de450c80bbe5d7d7dcb919ded7015ca447d8edb1/crates/iceberg/src/expr/visitors/inclusive_projection.rs#L329-L356
I checked spark and it also fails: pyspark.errors.exceptions.captured.IllegalArgumentException: Cannot add redundant partition: 1000: ts_year: year(2) conflicts with 1001: ts_month: month(2)
I adapted the broken tests for predicates. @liurenjie1024 it would be good to have an extra vigilant eye on those. Most changed needed to be done due to incompatible schemas for certain partition specs.
I checked spark and it also fails: pyspark.errors.exceptions.captured.IllegalArgumentException: Cannot add redundant partition: 1000: ts_year: year(2) conflicts with 1001: ts_month: month(2)
I think they don't allow, see these two methods: https://github.com/apache/iceberg/blob/afda8be25652d44d9339a79c6797b6bf20c55bd6/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L418 https://github.com/apache/iceberg/blob/c4f9337a61bf9f4e80f058e3f77b9cb8f617522e/api/src/main/java/org/apache/iceberg/transforms/Dates.java#L208
I adapted the broken tests for predicates. @liurenjie1024 it would be good to have an extra vigilant eye on those. Most changed needed to be done due to incompatible schemas for certain partition specs.
Also cc @Fokko to take a look.
@liurenjie1024 ready for another round!