kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

kedro-airflow: updating support for Kubernetes

Open DimedS opened this issue 10 months ago • 4 comments

Description

To facilitate running Kedro Airflow on Kubernetes, the kedro-airflow-k8s plugin was developed. However, it only supports versions of Kedro up to 0.18.0, while the current version is 0.19.4. Consequently, we have moved the recommendation to use this plugin to the end of our airflow deployment documentation. We now need to determine the best approach for using Kedro Airflow on Kubernetes going forward.

DimedS avatar Apr 15 '24 10:04 DimedS

@lasica @marrrcin Any thoughts? Are you accepting PRs on getindata/kedro-airflow-k8s?

astrojuanlu avatar Apr 15 '24 13:04 astrojuanlu

You can use the official one and run on k8s. See https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/

marrrcin avatar Apr 16 '24 07:04 marrrcin

As I understand:

If I have a Kubernetes Cluster, I can deploy Airflow there using Helm and customise the deployment with a values.yaml file and a custom Docker image to run my Kedro project's DAG. The process involves:

  • Replacing Memory Datasets with persistent ones or grouping nodes.
  • Setting environment variables.
  • Manually copying my DAG to the Airflow Scheduler Pod and copying my project's config and package files to the Docker folder
  • Creating a custom Dockerfile with my Kedro package installation command.

So technically, I don't need anything special to run Kedro on Airflow deployed on a Kubernetes cluster; it's enough to use a DAG created by the kedro-airflow plugin. However, this setup only allows me to run one Kedro project per Airflow deployment. If I want to run multiple projects in the same Airflow deployment, I can use the KubernetesPodOperator() for each Airflow task (i.e., Kedro node). This will execute each task in an isolated, customised container in a separate Kubernetes Pod, with the KubernetesExecutor dynamically managing all these pods.

However, this approach might be inefficient if there are many Kedro nodes, as it will require deploying many containers. It's better to group nodes to reduce the number of tasks, and thus the number of pods.If I understood correctly, additional functionality in the kedro-airflow plugin to help modify your DAG by inserting the KubernetesPodOperator() and KubernetesExecutor parts would be beneficial.

Do you have the same opinion, @marrrcin? Is using the KubernetesPodOperator() for each task a good solution?

DimedS avatar May 17 '24 13:05 DimedS

Hi, so the solution I've linked above (https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/) does exactly that - it either runs N:N <Kedro nodes>:<pod for each node> or with grouping N:M <Kedro nodes>:<pod for each group>. It also allows to use the same Airflow deployment and run multiple Kedro projects within the same instance with full isolation. Imho that's the best approach here. I would say that the default template should encourage to use KubernetesPodOperator.

marrrcin avatar Jun 05 '24 11:06 marrrcin