partitioned write support

Open jqin61 opened this issue 1 year ago • 1 comments

Todo

[x] support partitioned append()
- [x] support append with identity transform
- [x] fix scenario when arrow table schema not aligned with iceberg schema (finished by others)
- [x] add integration test for null column partitioning after issue#348 is closed
- [x] avoid sorting input arrow table when it is already sorted
[x] support partitition field in manifest file see PartitionKey
[x] apply transform for partitioning algorithm efficiency analysis when transform involved
[x] support partitioned static overwrite()
- [x] overwrite entire table
- [x] overwrite with expression or filter string (specified partition)
- [x] overwrite filter validatoin (https://github.com/jqin61/iceberg-python/pull/4) as discussed in the monthly meeting, overwrite will be supported by delete + append. we will support more wild filters than spark iceberg and might rewrite files for overwriting rather than just using IsNull and EqualTo. So this is not needed.
[x] extend summary for partitioned stats (https://github.com/apache/iceberg-python/pull/521)
[x] support partitioned dynamic overwrite()

Feb 02 '24 05:02 jqin61

As discussed in the monthly community sync, this will be broken down into 4 prs of:

Partitioned append with identity transform
Dynamic overwrite using delete + append, 2 snapshots in one commit
Hidden partitioning support (for slicing the arrow table, manifest file entry.partition, data file path)
Static overwrite using delete + append, 2 snapshots in one commit

Mar 28 '24 14:03 jqin61