kedro-airflow icon indicating copy to clipboard operation
kedro-airflow copied to clipboard

[KED-1426] Handling of out of catalog items

Open nikos-kal opened this issue 4 years ago • 2 comments

Description

Airflow requires that data is persisted between nodes, thus out of catalog entries or MemoryDatasets are not supported.

Context

The process of converting all I/O to become persisted can be a dull and long one (catalog entries for every single param etc.). An item that the user chose not to persist between nodes will likely be a minor parameter or some small / easy-to-compute table. Given that, it would be great if the Airflow plug-in offered the option for default automatic handling of such out-of-catalog data.

Possible Implementation

There is a toggle introduced that switches automatic handling of such data and a default DataSet to handle storage (since most python objects can be pickled PickleDataset would fit here). The user can provide a different default DataSet or switch the functionality off entirely.

Example for inspiration:

auto_out_of_catalog_store = True
out_of_catalog_root = "data/airflow_temp/"
out_of_catalog_default_dataset = PickleLocalDataset

nikos-kal avatar Feb 24 '20 14:02 nikos-kal

@nikos-kal

Thank you for the feedback! We were aware of this issue, and are discussing who's responsible to do it (plugin, catalog class, runner class, or user's config etc.) at the moment, and will update you once we have decided the tech design for this :)

921kiyo avatar Feb 24 '20 14:02 921kiyo

Hello kedro-airflow team, I am pretty new to kedro and especially airflow. I wanted to ask if the problem with the MemoryDatasets still exists when converting a kedro project to an airflow dag? Although the referenced issue quantumblacklabs/kedro#501 above is closed, it seems they did not face that problem, right?

Thanks for your help!

alextsakpinis avatar Dec 09 '21 18:12 alextsakpinis