kedro-plugins
kedro-plugins copied to clipboard
feat(datasets): add pandas.DeltaSharingDataset
Overview
This PR introduces a new dataset called DeltaSharingDataset, designed to load data from Delta Sharing shared tables into Pandas DataFrames. Delta Sharing is an open protocol that allows organizations to securely exchange large datasets in real-time, independent of the computing platforms they use. The dataset supports read-only operations and provides a way to integrate Delta Sharing data into Kedro workflows for data analysis and processing.
- Documentation: https://github.com/delta-io/delta-sharing.
Features
- Protocol: The
DeltaSharingDatasetis built using the Delta Sharing open protocol, enabling secure real-time data exchange. - Data Loading: It loads data into a Pandas DataFrame by leveraging the
delta_sharing.load_as_pandasfunction, allowing for easy data manipulation and analysis within Kedro pipelines. - Versioning: You can specify a particular version of the dataset or load the latest version by default.
- Row Limiting: Supports limiting the number of rows loaded for previewing or partial data loading.
- Delta Format Option: Optionally load data in Delta format by setting the
use_delta_formatargument. - Profile Credentials: Access to Delta Sharing tables is handled via a credentials dictionary, where the path to the Delta Sharing profile must be provided.
Example Usage
YAML API:
my_delta_sharing_dataset:
type: pandas.DeltaSharingDataset
share: <share-name>
schema: <schema-name>
table: <table-name>
credentials:
profile_file: <profile-file-path>
load_args:
version: <version>
limit: <limit>
use_delta_format: <use_delta_format>
Python API:
from kedro_datasets import DeltaSharingDataset
import pandas as pd
credentials = {
"profile_file": "conf/local/config.share"
}
load_args = {
"version": 1,
"limit": 10,
"use_delta_format": True
}
dataset = DeltaSharingDataset(
share="example_share",
schema="example_schema",
table="example_table",
credentials=credentials,
load_args=load_args
)
data = dataset.load()
print(data)
Key Configuration Parameters
share: The Delta Sharing share name.schema: The schema name within the share.table: The table name to load data from.credentials.profile_file: Path to the Delta Sharing profile file.load_args.version: The version of the table snapshot to load. If not provided, the latest version is loaded.load_args.limit: Maximum number of rows to load. Useful for data previews.load_args.use_delta_format: Whether to use Delta format for loading data. Defaults toFalse.
Limitations
- No Save Support: The
DeltaSharingDatasetis read-only and does not support saving data back to Delta Sharing tables.
Impact
This new dataset offers a simple, cost-effective way to incorporate Delta Sharing data into Kedro projects. It is especially useful in environments where shared data is accessed frequently for analysis, enabling users to leverage Delta Sharing's protocol for data interoperability without the need for heavy compute resources.
Why Delta Sharing?
- Interoperability: Delta Sharing allows data sharing between platforms without locking into a specific infrastructure.
- Cost-Efficiency: With read-only access, it minimizes resource usage by separating data storage and compute resources.
- Security: Built on a secure, REST-based protocol for trusted data sharing.
By adding this dataset, users can connect to Delta Sharing shared tables and manage large datasets in Pandas for data science tasks, making Kedro more versatile in handling modern data-sharing use cases.
Future Improvements
- Enhanced dataset operations for Spark DataFrames.
Hi @hugodscarvalho , thanks a lot for your contribution!
If I may ask: In your understanding, how does Delta Sharing compare to the Iceberg REST API?
@hugodscarvalho Do you still want to finish this PR?
Hi @hugodscarvalho are you still interested in finishing this PR?
Closing this due to inactivity from the author. @hugodscarvalho or anyone else feel free to re-create this PR if you'd like to continue working on it.