iceberg-python
iceberg-python copied to clipboard
PyIceberg Near-Term Roadmap
Feature Request / Improvement
PyIceberg 0.7.0
The main objective of 0.7.0 is to have partitioned writes (non-exhaustive list :)
- [x] Support for merge-into / upsert: https://github.com/apache/iceberg-python/issues/402
- [x] Support partitioned appends: https://github.com/apache/iceberg-python/pull/784
- [x] Support partial deletes: https://github.com/apache/iceberg-python/pull/569
- [x] Support for parallelizing writes: https://github.com/apache/iceberg-python/issues/428 https://github.com/apache/iceberg-python/issues/346
- [x] Support parallelized writes: https://github.com/apache/iceberg-python/pull/444
- [x] Support
table_existson catalog: https://github.com/apache/iceberg-python/issues/406 https://github.com/apache/iceberg-python/issues/507, fixed in https://github.com/apache/iceberg-python/pull/512 - [x] Metadata tables: https://github.com/apache/iceberg-python/issues/511
- [x] Files assigned to @Gowthami03B, PR in https://github.com/apache/iceberg-python/pull/614
- [x] Snapshots assigned to @Fokko in https://github.com/apache/iceberg-python/pull/524/
- [x] History assigned to @ndrluis in https://github.com/apache/iceberg-python/pull/828
- [x] Metadata log entries @kevinjqliu (issue in https://github.com/apache/iceberg-python/issues/594): https://github.com/apache/iceberg-python/pull/667
- [x] Manifests @geruh: PR in https://github.com/apache/iceberg-python/pull/717
- [x] Partitions assigned to @syun64 (issue in https://github.com/apache/iceberg-python/issues/24): https://github.com/apache/iceberg-python/pull/603
- [x] References assigned to @geruh in https://github.com/apache/iceberg-python/pull/602
- [x] Entries assigned to @Fokko in https://github.com/apache/iceberg-python/pull/551
- [ ] Manifest read/write improvements:
- [ ] Implement rolling writes: https://github.com/apache/iceberg-python/issues/596 https://github.com/apache/iceberg-python/pull/650
- [ ] Caching of manifests: https://github.com/apache/iceberg-python/issues/595
- [ ] Incremental append scan: https://github.com/apache/iceberg-python/pull/533
PyIceberg 0.8.0
- [ ] Table maintenance:
- [ ] Snapshot expiration
- [ ] Metadata rewrites
- [ ] Compaction
- [ ] Delete orphan files
- [ ] Catalogs:
- [ ] Snowflake catalog: https://github.com/apache/iceberg-python/pull/687
- [x] Nessie catalog: https://github.com/apache/iceberg-python/issues/19
- [ ] BigLake catalog: https://github.com/apache/iceberg-python/issues/651
- [ ] ORC Support: https://github.com/apache/iceberg-python/issues/20
- [ ] Branch Support: https://github.com/apache/iceberg-python/issues/306
- [x] Tag Support: https://github.com/apache/iceberg-python/issues/573. PR: https://github.com/apache/iceberg-python/pull/603
- [ ] Write with Sort Order https://github.com/apache/iceberg-python/issues/271
- [ ] Support deletes with Merge-on-read: https://github.com/apache/iceberg-python/issues/1078
- [ ] Support writes to Bucket Partitioned Tables: https://github.com/apache/iceberg-python/issues/1074
PyIceberg 1.0.0
Long-term goals:
- Support Griffe to detect breaking API changes https://github.com/apache/iceberg-python/issues/334: https://github.com/apache/iceberg-python/pull/394
- Implement Arrow dataset: https://github.com/apache/iceberg-python/issues/30
- Support table maintenance operations: https://github.com/apache/iceberg-python/issues/31
- Add View support
- Add Puffin support
- Support engine integrations
- [ ] DuckDB
- [x] Daft (https://github.com/Eventual-Inc/Daft/issues/1877)
- [ ] Polars (https://github.com/pola-rs/polars/pull/15018)
- [ ] Ray
- [ ] Support Commit Retries: https://github.com/apache/iceberg-python/issues/269
@kevinjqliu @Fokko Where would something like the Iceberg Spark create_changelog_view procedure fit in this roadmap? Is that something that might be tackled as part of the other procedures under table maintenance, or is it likely to come later (1.0.0), or not at all in PyIceberg?
Sorry for the late reply, I was touching grass.
@kevinjqliu @Fokko Where would something like the Iceberg Spark create_changelog_view procedure fit in this roadmap? Is that something that might be tackled as part of the other procedures under table maintenance, or is it likely to come later (1.0.0), or not at all in PyIceberg?
Thanks for bringing this up @corleyma 🙌 Some related work is being done in https://github.com/apache/iceberg-python/pull/533/ and I think PyIceberg should definitely support something like that.
@kevinjqliu @Fokko where would something like https://github.com/apache/iceberg-python/issues/402 go?
I've added it to the overview. Once the partial deletes + partitioned writes are in, this is supported automatically. We might want to have some community discussion on the API once those two PRs land.
@Fokko can we add issues for creating tests and documentation for the new features of 0.7.0 as good first issues?
@Fokko can we add issues for creating tests and documentation for the new features of 0.7.0 as good first issues?
@tusharchou: Whenever you create a new feature, you need to add the unit & integration test and make the necessary changes in mkdocs as a part of that PR, but if you feel like there are some missing parts, please feel free to raise an improvement/issue and we can discuss that in the python syncup.
It looks BigLake metastore is going to be replaced with BigQuery metastore. Is the version 0.8.0 roadmap still up-to-date?
https://github.com/trinodb/trino/issues/20031#issuecomment-2310391785
@jaehyeon-kim That is correct. BigQuery Metastore is the replacement for BigLake Metastore. I recommend adjusting the roadmap to skip BigLake metastore and add support for BigQuery Metastore. This PR to the Iceberg Java libraries should be good reference.
Thanks for the context @anoopj. @jaehyeon-kim looks like #651 is a feature request. There's currently no committed date to implement it, I'll readjust the roadmap to reflect that.