iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support get partition table with filter

Open Fokko opened this issue 2 years ago • 7 comments

Feature Request / Improvement

Migration of issue https://github.com/apache/iceberg/issues/8619

Fokko avatar Oct 02 '23 10:10 Fokko

Hello @Fokko, here is my use case:

  1. Given a table, find out all the partitions it has.
  2. Given a partition filter, check if such partition exist in the table.

Thank you!

puchengy avatar Oct 02 '23 16:10 puchengy

@puchengy The problem with Iceberg is that the partition is more of a logical concept, rather than a physical path like in a Hive table. What do you think of passing in a predicate, and letting the Airflow sensor pass if there are rows?

For example, you could go from a daily to an hourly partition. Then you would get:

2023-01-01T00:00:00
2023-01-02T00:00:00
2023-01-03T00:00:00
2023-01-03T23:00:00 # Changed from daily to hourly
2023-01-04T00:00:00
2023-01-04T01:00:00
2023-01-04T02:00:00
2023-01-04T03:00:00

Fokko avatar Oct 02 '23 17:10 Fokko

What do you think of passing in a predicate, and letting the Airflow sensor pass if there are rows?

@Fokko That works. This is actually what we are doing (but for legacy_python) https://github.com/pinterest/iceberg/commit/7d8d65d7ae8ed559052444928f00c36a11fe8f7d

Would this be something we can implement in the upstream? Thanks

puchengy avatar Oct 02 '23 17:10 puchengy

@Fokko gentle ping, thanks ^

puchengy avatar Oct 05 '23 18:10 puchengy

@puchengy Yes, certainly. Would this be something that you're interested in working on? From the snapshot, we can load the manifest list, and from there the manifests themselves, which contain the partition information

Fokko avatar Oct 05 '23 18:10 Fokko

@Fokko Yes, I can help. Thanks.

puchengy avatar Oct 06 '23 02:10 puchengy

I was looking for something comparable to spark's partitions metadata table, which lets me do something like this

SELECT partition, last_updated_snapshot_id, last_updated_at
FROM prod.db.table.partitions
WHERE partition.foo='bar'

to determine if and when a partition was updated, and came across this issue. It sounds like this could be provided by this Feature Request if it includes a ManifestReader with filtering features like the linked code in legacy_python. Is that correct? If not, I will try to raise as a separate issue.

pp-akursar avatar Feb 29 '24 22:02 pp-akursar

Partitions table was added in: https://github.com/apache/iceberg-python/pull/603

sungwy avatar Jun 15 '24 22:06 sungwy