Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Support partition evolution (old files having different partitoning schemes vs new files)

Open jaychia opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe.

Currently Daft makes an assumption that all files being retrieved from a given Iceberg table has the same partitioning:

  1. Retrieve current partition spec from the table
  2. Translate any predicates into partition filters (e.g. dt > 1970-02-01 becomes day(dt) > 30)
  3. Apply this partition filter naively to any ScanTasks

However, in certain cases, the partitioning of old data might differ from the current partitoning spec through the process of "partition evolution". For example, if the partitioning used to be month(dt) then the predicate from before should be correctly translated to day(dt) > 30 for new files, but month(dt) > 1 for old files.

See: #2084 for tests

jaychia avatar May 07 '24 21:05 jaychia

@jaychia can you merge in the tests behind a pytest skip? I'll take a look after that!

samster25 avatar May 13 '24 18:05 samster25

@jaychia can you merge in the tests behind a pytest skip? I'll take a look after that!

Sounds good, pending merge: https://github.com/Eventual-Inc/Daft/pull/2084

jaychia avatar May 13 '24 19:05 jaychia