kedro
kedro copied to clipboard
Access dataset filepath via public API for file-backed datasets
Description
Users encounter challenges related to accessing and managing dataset filepaths. The absence of a mandatory filepath attribute in AbstractDataset
and the lack of a standard API for accessing metadata hinder users' ability to reliably access dataset filepaths and understand which dataset version was loaded. Additionally, inconsistencies between APIs across different dataset types further complicate the process, requiring users to implement custom logic to handle dataset access and metadata retrieval.
We propose:
- Explore the feasibility of implementing file-backed
AbstractDataset
and making the filepath attribute mandatory to provide users with a consistent and reliable way to access dataset filepaths. - Develop a standard API for accessing metadata across different dataset types, and decide what the standard metadata should include for each dataset type.
Relates to https://github.com/kedro-org/kedro/issues/1936
Context
-
Inconsistency of APIs between
AbstractVersionedDataset
andAbstractDataset
- one has filepath attribute: "It's kind of weird that when I switch fromAbstractDataset
to theAbstractVersionedDataset
, suddenly the file path appears at that point. Like that feels quite weird to me that doesn't feel right." -
Users have to take into account the dataset type to be able to get the filepath:
https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py#L48
-
Hard to get filepath and understand which dataset version was loaded: "It's crazy confusing to actually get the the correct file path"
-
Users want some standardised API to access different types of datasets, for example file-backed. They want to rely on API when using
DataCatalog
/Datasets
which is not mandatory to follow now: "We haveMlflowArtifactDataset
which is a wrapper for anyAbstractDataset
which logs the dataset automatically in mlflow as an artifact when its save method is called. The lack of a formalAbstractDataset
API for file paths leads to inconsistencies, relying heavily on the convention that file paths are included as a hidden property_file_path
in the dataset’s implementation. Formalizing this attribute as a public property would enhance reliability and convenience across Kedro’s framework. Otherwise, the potential difficulties in maintaining this might arise with community-maintained or experimental datasets, as it would be super hard to enforce that"
https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html
- Users find it challenging to access critical metadata such as file paths directly through the public API, which often requires delving into less transparent, potentially private API elements. This adds complexity to what could otherwise be straightforward data management tasks.