kedro-datasets: Add support for Iceberg Tables
Description
Follow up from documenting Kedro + Iceberg in https://github.com/kedro-org/kedro/pull/4521
Add native support for Iceberg tables to kedro-datasets
Context
We don't have any datasets that support Iceberg tables, the documentation added in https://github.com/kedro-org/kedro/pull/4521 is fairly minimal and has limitations:
- Only works for
pandasdataframes - Works with
pyicebergbehind the scenes, which doesn't support the full range of features you can leverage for Iceberg tables - Is custom implementation
I also want to get more feedback from the community about what level of support and features they would expect from this/these dataset/s. Also, would like to hear from users about how the currently use Iceberg tables with Kedro.
Possible Implementation
- Extend the custom example from docs
- Use other libraries in the backend
- Spark + Iceberg dataset
I'd vote for Polars or Duckdb via Ibis
I'd recommend taking a look at https://github.com/dagster-io/community-integrations/tree/main/libraries/dagster-iceberg as a point of comparison (and potential starting point); Dagster I/O managers are fairly analogous to Kedro-Datasets in that they wrap a high-level load and save method, and @JasperHG90 did a great PyIceberg-based implementation. There's also a WIP Spark I/O manager, but that likely should be a Spark dataset, if anything.
It's worth noting that Kedro doesn't have as well-defined a concept of partitioning, so that may not translate without more work.