kedro Access dataset filepath via public API for file-backed datasets

Access dataset filepath via public API for file-backed datasets

Open ElenaKhaustova opened this issue 8 months ago • 0 comments

Description

Users encounter challenges related to accessing and managing dataset filepaths. The absence of a mandatory filepath attribute in AbstractDataset and the lack of a standard API for accessing metadata hinder users' ability to reliably access dataset filepaths and understand which dataset version was loaded. Additionally, inconsistencies between APIs across different dataset types further complicate the process, requiring users to implement custom logic to handle dataset access and metadata retrieval.

We propose:

Explore the feasibility of implementing file-backed AbstractDataset and making the filepath attribute mandatory to provide users with a consistent and reliable way to access dataset filepaths.
Develop a standard API for accessing metadata across different dataset types, and decide what the standard metadata should include for each dataset type.

Relates to https://github.com/kedro-org/kedro/issues/1936

Context

Inconsistency of APIs between AbstractVersionedDataset and AbstractDataset - one has filepath attribute: "It's kind of weird that when I switch from AbstractDataset to the AbstractVersionedDataset, suddenly the file path appears at that point. Like that feels quite weird to me that doesn't feel right."
Users have to take into account the dataset type to be able to get the filepath:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py#L48

Hard to get filepath and understand which dataset version was loaded: "It's crazy confusing to actually get the the correct file path"
Users want some standardised API to access different types of datasets, for example file-backed. They want to rely on API when using DataCatalog / Datasets which is not mandatory to follow now: "We have MlflowArtifactDataset which is a wrapper for any AbstractDataset which logs the dataset automatically in mlflow as an artifact when its save method is called. The lack of a formal AbstractDataset API for file paths leads to inconsistencies, relying heavily on the convention that file paths are included as a hidden property _file_path in the dataset’s implementation. Formalizing this attribute as a public property would enhance reliability and convenience across Kedro’s framework. Otherwise, the potential difficulties in maintaining this might arise with community-maintained or experimental datasets, as it would be super hard to enforce that"

https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html

Users find it challenging to access critical metadata such as file paths directly through the public API, which often requires delving into less transparent, potentially private API elements. This adds complexity to what could otherwise be straightforward data management tasks.

Screenshot 2024-06-05 at 16 33 40

Jun 05 '24 15:06 ElenaKhaustova

kedro kedro copied to clipboard

Access dataset filepath via public API for file-backed datasets

Description

Context

kedro
kedro copied to clipboard