kedro-plugins
kedro-plugins copied to clipboard
Introduce KedroOperator for Airflow
Discussed in https://github.com/kedro-org/kedro/discussions/1716
Originally posted by kangshung July 20, 2022 Hey, are there any plans to add KedroOperator to an official Airflow provider list? This would really make this operator "official". KedroSession, if needed, could be probably passed in a different way.
Let's discuss this as part of Technical Design and talk about what the plan is for kedro-airflow and how the operator would fit in.
Technical design discussion 30.11.2022:
- Do we want to be an official Airflow provider?
We should figure out direction of Kedro Airflow support, defer answer to this point until then. @idanov to create a new issue related to this.
Pros:
- Increased visibility for Kedro.
- Long term it is the right move.
Cons:
- Effort to port + maintain.
- Have to support multiple repositories.
- What changes would we need to make to
kedro-airflow? Currentlykedro-airflow/plugin.pycreates a*_dag.pyfile containingKedroOperator(one per Pipeline) using a Jinja2 template. The resulting DAG file is imported to Airflow, where it can then be run. This is different to how the code of other Airflow providers is structured.
- Create a new module for
KedroOperator - Port from jinja2 format to regular Python module
To move this to airflow.provdiers, we will have to move the implementation of KedroOperator to the airflow repository.
Questions:
- Can we move only the
KedroOperator, but not the otherhooksimplementation? What's the requirement to become an official provider?
I'd like to know @marrrcin's and @sbrugman's opinions here. From https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/ for example:
I quickly established my opinion about the quick start setup - the example given there is unpractical, as it is flawed in a few ways that I'd like to avoid in my solution:
- First, it assumes that Airflow and Kedro know about each other. I would prefer to isolate these two environments so that I don't need to import Kedro in Airflow or Airflow in Kedro. As managing dependencies in Airflow is challenging, it would be better to avoid this problem altogether.
- From the above it seems that both would have to have similar needs regarding the machine specifications they run on, as they would be executed in the same environment.
- Thirdly, as the code would be executed by the same processes, it would need to be shared in the form of packages. In this setup Airflow runs in a docker image, so then I'd have to either re-build and re-run this image every time either the Airflow or Kedro project code changes, OR additionally manage lots of virtual python environments somewhere and ship the new versions of the micro-packaged Kedro pipelines there whenever the code changes.
On the other hand, as a non-k8s expert I'd be rather bummed if I had to always use Kubernetes to deploy Kedro on Airflow, and as such I understand that kedro-airflow provides a simpler experience. I know that @sbrugman uses it a lot.
kedro-airflow maintenance at the moment isn't great, and its docs aren't very informative https://github.com/kedro-org/kedro-plugins/issues/394 and I think getting more people to use it (including ourselves) is a prerequisite for this.
Yeah, what we've experienced is that current docs for kedro-airflow somehow neglect the "heavy lifting" part - especially when it comes to the Airflow setup (I know that it's not responsibility of Kedro to explain how to manage Airflow) - maybe it would be a good idea to have a warning sign at the top saying that deploying Kedro on Airflow requires some Airflow knowledge anyway and the quickstart is quickstart of "Kedro on Airflow" not "Kedro plus Airflow" 🤔
As @jmholzer said in https://github.com/kedro-org/kedro-plugins/issues/482#issuecomment-1854424050,
https://github.com/kedro-org/kedro-plugins/blob/552b973a256c0f4a9f96e36feb70f4fc15fb371b/kedro-airflow/kedro_airflow/airflow_dag_template.j2#L14-L39
would need to be on its own Python file.
to add a bit more context, kedro-airflow still enjoys a large number of downloads
│ kedro-vertexai ┆ 29211 │
│ kedro-argo ┆ 29304 │
│ kedro-airflow-k8s ┆ 36430 │
│ kedro-kubeflow ┆ 39011 │
│ kedro-static-viz ┆ 42838 │
│ kedro-azureml ┆ 50521 │
│ kedro-wings ┆ 51509 │
│ kedro-great ┆ 64275 │
│ kedro-airflow ┆ 76939 │
│ kedro-neptune ┆ 94213 │
│ kedro-docker ┆ 155405 │
│ kedro-mlflow ┆ 401860 │
│ kedro-telemetry ┆ 1798656 │
│ kedro-datasets ┆ 2173063 │
│ kedro-viz ┆ 4022666 │
│ kedro ┆ 17221746 │
└──────────────────────────┴──────────┘