iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

partitioned write support

Open jqin61 opened this issue 1 year ago • 1 comments

Todo

  • [x] support partitioned append()
    • [x] support append with identity transform
    • [x] fix scenario when arrow table schema not aligned with iceberg schema (finished by others)
    • [x] add integration test for null column partitioning after issue#348 is closed
    • [x] avoid sorting input arrow table when it is already sorted
  • [x] support partitition field in manifest file see PartitionKey
  • [x] apply transform for partitioning algorithm efficiency analysis when transform involved
  • [x] support partitioned static overwrite()
    • [x] overwrite entire table
    • [x] overwrite with expression or filter string (specified partition)
    • [x] overwrite filter validatoin (https://github.com/jqin61/iceberg-python/pull/4) as discussed in the monthly meeting, overwrite will be supported by delete + append. we will support more wild filters than spark iceberg and might rewrite files for overwriting rather than just using IsNull and EqualTo. So this is not needed.
  • [x] extend summary for partitioned stats (https://github.com/apache/iceberg-python/pull/521)
  • [x] support partitioned dynamic overwrite()

jqin61 avatar Feb 02 '24 05:02 jqin61

As discussed in the monthly community sync, this will be broken down into 4 prs of:

  1. Partitioned append with identity transform
  2. Dynamic overwrite using delete + append, 2 snapshots in one commit
  3. Hidden partitioning support (for slicing the arrow table, manifest file entry.partition, data file path)
  4. Static overwrite using delete + append, 2 snapshots in one commit

jqin61 avatar Mar 28 '24 14:03 jqin61