dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Feat/iceberg advanced partitioning

Open rakesh-tmdc opened this issue 3 months ago • 5 comments

Hey Team,

I’ve been using dlt for the past 3–4 months, mostly with Apache Iceberg as the destination. Recently, I needed support for Iceberg partitioning, especially for more advanced use cases like time and bucket partitions.

I’ve implemented support for these in a way that’s fully compatible with existing column-level partition configurations: Still works with earlier formats like:

{ "region": { "partition": true }, "category": { "partition": true } }

Now also supports advanced options like: { "date_added": { "partition": { "type": "year", "index": 1, "name": "yearly_partition" } }, "user_id": { "partition": { "type": "bucket", "index": 2, "bucket_count": 32, "name": "user_bucket" } }, "region": { "partition": { "type": "identity", "index": 3 } } }

Would love feedback from the team!

rakesh-tmdc avatar Sep 02 '25 15:09 rakesh-tmdc

Deploy Preview for dlt-hub-docs canceled.

Name Link
Latest commit 5916083f1a6abd6fe6aeedf90720c998dafa1b60
Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/68b70681267a61000844a333

netlify[bot] avatar Sep 02 '25 15:09 netlify[bot]

Hi @rakesh-tmdc, thanks for the contribution, this looks good and useful. In dlt+ we already have an iceberg_adapter() with iceberg_partition helpers for these transforms. We're open to moving this adapter module to open source dlt so your PR can reuse it and stay fully compatible with our existing semantics/docs.

If you're up for it, we can extract the adapter and have your changes delegate partition spec parsing/validation to it to keep behavior consistent across catalogs.

burnash avatar Sep 14 '25 19:09 burnash

Thanks @burnash , glad to hear this is useful! Extracting the iceberg_adapter and its partition helpers into open source sounds like a great idea — I’d definitely prefer to reuse that instead of duplicating logic.

Once it’s available, I can rework my PR so that partition spec parsing/validation delegates to the adapter, which should keep things consistent. Just let me know when/where the adapter lands, and I’ll update accordingly.

rakesh-tmdc avatar Sep 15 '25 07:09 rakesh-tmdc

Hi @rakesh-tmdc,

I've ported the iceberg_adapter() and iceberg_partition helpers from dlt+ to dlt core. These are now available in dlt/destinations/impl/filesystem/iceberg_adapter.py and provide the canonical way in dlt to configure Iceberg partitioning going forward.

Now that we have the adapter in place, here's what needs to happen next to complete this PR:

  1. Remove the old implementation from dlt/common/libs/pyiceberg.py. The following code should be deleted as it's now superseded by the iceberg_adapter: PartitionType enum and old PartitionSpec dataclass; IcebergPartitionManager; _validate_partition_spec(), _validate_and_fix_indices(), extract_partition_specs_from_schema() functions
  2. Reset the dlt/common/schema/typing.py back to Optional[bool]
  3. Update the tests in tests/common/libs/test_pyiceberg.py so they don't test removed functions and ensure the tests test the new adapter-based approach. You can see the example of iceberg_adapter() and iceberg_partition usage in the iceberg_adapter() docstring.

You can also test the backward compatibility with older way to define identity partition by running:

TESTS__BUCKET_URL_FILE="_storage/data" pytest tests/load/pipeline/test_open_table_pipeline.py::test_table_format_partitioning -k "iceberg"

where _storage/data is a path to a local folder for the Iceberg files.

Let me know if you have any questions!

burnash avatar Oct 30 '25 15:10 burnash

Hi @rakesh-tmdc, I wanted to check in and see if you're still interested in continuing with this PR? If you'd like to move forward or you're currently busy or have moved on to other priorities, that's completely understandable: I can take over from here.

Either way, thanks for the contribution and for kicking off this feature.

burnash avatar Nov 10 '25 10:11 burnash

Hi @rakesh-tmdc,

Just checking in again. We'd really like to get this PR merged and we now have the iceberg_adapter API in place. Since this PR has been quiet for a bit, we’ll go ahead and remove the old implementation in pyiceberg.py and adapt the tests to the new adapter-based API. If you'd still like to finish it yourself, just shout and we can coordinate. Thanks again for starting this work!

burnash avatar Nov 22 '25 14:11 burnash

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 093ad285 Commit Preview URL

Branch Preview URL
Nov 28 2025, 11:52 PM

Hi @burnash — sorry for the delayed response! I missed some of the recent updates. If you’d still like me to continue working on this PR, I’m happy to pick it back up and follow through with the remaining changes.

Thanks again for all the work you’ve put into the adapter and tests!

rakesh-tmdc avatar Nov 29 '25 18:11 rakesh-tmdc