kedro
kedro copied to clipboard
[DataCatalog]: Catalog serialization and deserialization support
Description
- Users admit the lack of persistency in the add workflow, as there is no built-in functionality to save modified catalogs.
- Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to
YAML
format and back. - Users encounter difficulties loading pickled
DataCatalog
objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize theDataCatalog
object without dependency on Kedro versions.
We propose to explore the feasibility of implementing to_yaml()
and from_yaml()
methods for the DataCatalog
object to facilitate serialization and deserialization without dependency on Kedro versions.
Context
User feedback:
- Add workflow is missing persistency, so you can not save modified catalog: "You have a catalog and then you start adding extra stuff to it, currently we just throw away those added things when they close a notebook."
- Catalog to
YAML
function is needed to save modified catalog: "People have always asked for it. Could I have a catalog toYAML
function so that you could actually spit out theYAML
files that are needed to do this again later on?" - Competitors provide the functionality to compile catalog and showcase the result: "I would point to the DPC compile workflow. And actually, if you do DBT run it does DBT compile first and then runs the compiled outputs. Whereas in Kedro, you have your very concise complicated
YAML
and it will all that compilation happens at run time and there's no way for the user to see it." - When pickling
DataCatalog
object they experience difficulties in loading it back if the kedro version is different: "Serialization is an issue because I often pickle a catalog (mostly as part of a mlflow model). Pickling the catalog is really something that leads to a lot of problems because if I don't have the exact same Kedro version when I want to load the catalog, if the object has any change inside - private method or attribute it will lead to error."
https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/mlflow/kedro_pipeline_model.py#L143
# pseudo code
pickle.dumps(catalog)
pickle.loads(catalog) # this will fail if I reload with a newer kedro version and any attributes (even private) has changed. This breaks much more often that we should expect.
"It would be much more robust to be able to do this":
# pseudo code
catalog.serialize("path/catalog.yml") # name TBD: serialize? to_config? to_yaml? to_json? to_dict?
catalog.deserialize(catalog) # much more robust since it is not stored as python object -> maybe catalog.from_config?