kedro-plugins
kedro-plugins copied to clipboard
kedro-airflow: updating support for Kubernetes
Description
To facilitate running Kedro Airflow on Kubernetes, the kedro-airflow-k8s plugin was developed. However, it only supports versions of Kedro up to 0.18.0, while the current version is 0.19.4. Consequently, we have moved the recommendation to use this plugin to the end of our airflow deployment documentation. We now need to determine the best approach for using Kedro Airflow on Kubernetes going forward.
@lasica @marrrcin Any thoughts? Are you accepting PRs on getindata/kedro-airflow-k8s?
You can use the official one and run on k8s. See https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/
As I understand:
If I have a Kubernetes Cluster, I can deploy Airflow there using Helm and customise the deployment with a values.yaml
file and a custom Docker image to run my Kedro project's DAG. The process involves:
- Replacing Memory Datasets with persistent ones or grouping nodes.
- Setting environment variables.
- Manually copying my DAG to the Airflow Scheduler Pod and copying my project's config and package files to the Docker folder
- Creating a custom Dockerfile with my Kedro package installation command.
So technically, I don't need anything special to run Kedro on Airflow deployed on a Kubernetes cluster; it's enough to use a DAG created by the kedro-airflow
plugin. However, this setup only allows me to run one Kedro project per Airflow deployment. If I want to run multiple projects in the same Airflow deployment, I can use the KubernetesPodOperator()
for each Airflow task (i.e., Kedro node). This will execute each task in an isolated, customised container in a separate Kubernetes Pod, with the KubernetesExecutor
dynamically managing all these pods.
However, this approach might be inefficient if there are many Kedro nodes, as it will require deploying many containers. It's better to group nodes to reduce the number of tasks, and thus the number of pods.If I understood correctly, additional functionality in the kedro-airflow
plugin to help modify your DAG by inserting the KubernetesPodOperator()
and KubernetesExecutor
parts would be beneficial.
Do you have the same opinion, @marrrcin? Is using the KubernetesPodOperator()
for each task a good solution?
Hi, so the solution I've linked above (https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/) does exactly that - it either runs N:N <Kedro nodes>:<pod for each node> or with grouping N:M <Kedro nodes>:<pod for each group>. It also allows to use the same Airflow deployment and run multiple Kedro projects within the same instance with full isolation. Imho that's the best approach here. I would say that the default template should encourage to use KubernetesPodOperator.