kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

kedro-datasets: Add support for Iceberg Tables

Open ankatiyar opened this issue 9 months ago • 2 comments

Description

Follow up from documenting Kedro + Iceberg in https://github.com/kedro-org/kedro/pull/4521 Add native support for Iceberg tables to kedro-datasets

Context

We don't have any datasets that support Iceberg tables, the documentation added in https://github.com/kedro-org/kedro/pull/4521 is fairly minimal and has limitations:

  • Only works for pandas dataframes
  • Works with pyiceberg behind the scenes, which doesn't support the full range of features you can leverage for Iceberg tables
  • Is custom implementation

I also want to get more feedback from the community about what level of support and features they would expect from this/these dataset/s. Also, would like to hear from users about how the currently use Iceberg tables with Kedro.

Possible Implementation

  • Extend the custom example from docs
  • Use other libraries in the backend
  • Spark + Iceberg dataset

ankatiyar avatar Mar 07 '25 10:03 ankatiyar

I'd vote for Polars or Duckdb via Ibis

datajoely avatar Mar 07 '25 12:03 datajoely

I'd recommend taking a look at https://github.com/dagster-io/community-integrations/tree/main/libraries/dagster-iceberg as a point of comparison (and potential starting point); Dagster I/O managers are fairly analogous to Kedro-Datasets in that they wrap a high-level load and save method, and @JasperHG90 did a great PyIceberg-based implementation. There's also a WIP Spark I/O manager, but that likely should be a Spark dataset, if anything.

It's worth noting that Kedro doesn't have as well-defined a concept of partitioning, so that may not translate without more work.

deepyaman avatar Mar 25 '25 16:03 deepyaman