kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

feat(datasets): add pandas.DeltaSharingDataset

Open hugodscarvalho opened this issue 1 year ago • 2 comments

Overview

This PR introduces a new dataset called DeltaSharingDataset, designed to load data from Delta Sharing shared tables into Pandas DataFrames. Delta Sharing is an open protocol that allows organizations to securely exchange large datasets in real-time, independent of the computing platforms they use. The dataset supports read-only operations and provides a way to integrate Delta Sharing data into Kedro workflows for data analysis and processing.

  • Documentation: https://github.com/delta-io/delta-sharing.

Features

  • Protocol: The DeltaSharingDataset is built using the Delta Sharing open protocol, enabling secure real-time data exchange.
  • Data Loading: It loads data into a Pandas DataFrame by leveraging the delta_sharing.load_as_pandas function, allowing for easy data manipulation and analysis within Kedro pipelines.
  • Versioning: You can specify a particular version of the dataset or load the latest version by default.
  • Row Limiting: Supports limiting the number of rows loaded for previewing or partial data loading.
  • Delta Format Option: Optionally load data in Delta format by setting the use_delta_format argument.
  • Profile Credentials: Access to Delta Sharing tables is handled via a credentials dictionary, where the path to the Delta Sharing profile must be provided.

Example Usage

YAML API:

my_delta_sharing_dataset:
  type: pandas.DeltaSharingDataset
  share: <share-name>
  schema: <schema-name>
  table: <table-name>
  credentials:
    profile_file: <profile-file-path>
  load_args:
    version: <version>
    limit: <limit>
    use_delta_format: <use_delta_format>

Python API:

from kedro_datasets import DeltaSharingDataset
import pandas as pd

credentials = {
    "profile_file": "conf/local/config.share"
}
load_args = {
    "version": 1,
    "limit": 10,
    "use_delta_format": True
}

dataset = DeltaSharingDataset(
    share="example_share",
    schema="example_schema",
    table="example_table",
    credentials=credentials,
    load_args=load_args
)
data = dataset.load()
print(data)

Key Configuration Parameters

  • share: The Delta Sharing share name.
  • schema: The schema name within the share.
  • table: The table name to load data from.
  • credentials.profile_file: Path to the Delta Sharing profile file.
  • load_args.version: The version of the table snapshot to load. If not provided, the latest version is loaded.
  • load_args.limit: Maximum number of rows to load. Useful for data previews.
  • load_args.use_delta_format: Whether to use Delta format for loading data. Defaults to False.

Limitations

  • No Save Support: The DeltaSharingDataset is read-only and does not support saving data back to Delta Sharing tables.

Impact

This new dataset offers a simple, cost-effective way to incorporate Delta Sharing data into Kedro projects. It is especially useful in environments where shared data is accessed frequently for analysis, enabling users to leverage Delta Sharing's protocol for data interoperability without the need for heavy compute resources.

Why Delta Sharing?

  • Interoperability: Delta Sharing allows data sharing between platforms without locking into a specific infrastructure.
  • Cost-Efficiency: With read-only access, it minimizes resource usage by separating data storage and compute resources.
  • Security: Built on a secure, REST-based protocol for trusted data sharing.

By adding this dataset, users can connect to Delta Sharing shared tables and manage large datasets in Pandas for data science tasks, making Kedro more versatile in handling modern data-sharing use cases.

Future Improvements

  • Enhanced dataset operations for Spark DataFrames.

hugodscarvalho avatar Sep 12 '24 09:09 hugodscarvalho

Hi @hugodscarvalho , thanks a lot for your contribution!

If I may ask: In your understanding, how does Delta Sharing compare to the Iceberg REST API?

astrojuanlu avatar Oct 21 '24 15:10 astrojuanlu

@hugodscarvalho Do you still want to finish this PR?

noklam avatar Nov 10 '24 00:11 noklam

Hi @hugodscarvalho are you still interested in finishing this PR?

merelcht avatar May 26 '25 11:05 merelcht

Closing this due to inactivity from the author. @hugodscarvalho or anyone else feel free to re-create this PR if you'd like to continue working on it.

merelcht avatar Jun 02 '25 07:06 merelcht