airflow Deploy multiple (celery) workers with the Helm Chart

Description

Currently, the Helm Chart supports a single Celery worker deployment. The CeleryExecutor however, supports multiple queues. Ideally, the Helm Chart should allow deploying multiple Celery workers (different images, different resource allocations) to be able to use queues to send tasks to specialised workers (as already highlighted in the docs)

I am unsure if this is a matter of modifying the chart only or if Airflow really expects only one worker to be there, but in that case I don't see how queues could be used? Thanks a lot in advance.

Use case/motivation

It is often the case that tasks might require different hardware resources/different environments. For instance, if I have task A, which requires 1GB of memory, and task B, which requires 10GB, it's inefficient to run them both on the same Celery worker deployment.

Having multiple deployments would allow using queues to distribute tasks with finer granularity.

Related issues

No response

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Sep 08 '23 16:09 RonaldGalea

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

Sep 08 '23 16:09 boring-cyborg[bot]

It would be great if we add this feature to the helm chart, but it won't be an easy task, because to implement it we will need to add a loop for each celery template and support two configuration sources: a new value of type list and the existing one for backward compatibility. We also need to update KEDA to do the same and update its queries to scale the different celery clusters based on tasks' queues.

We can also use the existing value as default configurations for the different clusters, and override only the config provided in the list.

cc: @jedcunningham wdyt?

@RonaldGalea Would you like to work on it?

Sep 08 '23 19:09 hussein-awala

At one point we had an issue (or maybe a PR) that talked about the complexities of doing exactly this, but I can't find it right now.

I don't see (or recall) any blockers. It's just not a trivial thing to add. It would be great to get it added.

Sep 08 '23 21:09 jedcunningham

It would be great if we add this feature to the helm chart, but it won't be an easy task, because to implement it we will need to add a loop for each celery template and support two configuration sources: a new value of type list and the existing one for backward compatibility. We also need to update KEDA to do the same and update its queries to scale the different celery clusters based on tasks' queues.

We can also use the existing value as default configurations for the different clusters, and override only the config provided in the list.

cc: @jedcunningham wdyt?

@RonaldGalea Would you like to work on it?

Haven't done templating before but I'd be happy to give it a try. I'll see what I can learn this weekend

Sep 09 '23 11:09 RonaldGalea

I've had a look at how the helm chart is structured and my approach would be as follows:

First, identify the minimum options that need to be configurable for the feature to be useful. The ones I'm seeing are:

worker image
command & args
replicas
resources
autoscaling (keda)

Since setting the image specifically for workers is not yet configurable - that would be a first standalone PR.

Next, I would leave the current workers: key exactly as it is (both for backward compatibility and for the fact that there will always be at least one worker type) and introduce an additional-celery-workers: key, where the aforementioned options can be specified and the rest will be the same as for the default worker type.

I believe most worker-related components can be exactly the same for all workers (at least for now), so they should not require changes:

service account
service
network policy
DB connection setup for keda (e.g. pgbouncer network policy)

What does need to change is Keda - we will have to create additional ScaledObjects to reflect the addition of Deployments/StatefulSets. What I'm a bit unsure of is the query. Currently, it just lists everything from a table named task_instance - does this table contain queue-related info/can it be easily added?

Let me know if the overall approach sounds reasonable.

Sep 10 '23 19:09 RonaldGalea

Greetings,

I am evaluating the features in Airflow project. I am still new to the Airflow project. But It's good to see the old friend Celery. It is the project I am familiar with since 2015.

After reading your words, maybe you can start with simple one, Play Celery with django. It's easily start two workers with docker-compose and having some fun moments. Since we already have DagRun, once the dag files are synced/replicated to all workers, we should have a success.

Good luck! airflow pools join will be a major milestone for this project. :)

I've had a look at how the helm chart is structured and my approach would be as follows:

First, identify the minimum options that need to be configurable for the feature to be useful. The ones I'm seeing are:
* worker image

* command & args

* replicas

* resources

* autoscaling (keda)
Since setting the image specifically for workers is not yet configurable - that would be a first standalone PR.

Next, I would leave the current workers: key exactly as it is (both for backward compatibility and for the fact that there will always be at least one worker type) and introduce an additional-celery-workers: key, where the aforementioned options can be specified and the rest will be the same as for the default worker type.

I believe most worker-related components can be exactly the same for all workers (at least for now), so they should not require changes:
* service account

* service

* network policy

* DB connection setup for keda (e.g. pgbouncer network policy)
What does need to change is Keda - we will have to create additional ScaledObjects to reflect the addition of Deployments/StatefulSets. What I'm a bit unsure of is the query. Currently, it just lists everything from a table named task_instance - does this table contain queue-related info/can it be easily added?

Let me know if the overall approach sounds reasonable.

May 20 '24 04:05 guhuajun

I was wondering if there’s been any recent progress on this issue. This feature would be incredibly helpful for my use case. If things are currently delayed due to a busy schedule, would it be okay if I kindly offered to work on it and submit a pull request, perhaps based on @RonaldGalea ’s suggestions?

Apr 18 '25 07:04 ihnokim

I was wondering if there’s been any recent progress on this issue. This feature would be incredibly helpful for my use case. If things are currently delayed due to a busy schedule, would it be okay if I kindly offered to work on it and submit a pull request, perhaps based on @RonaldGalea ’s suggestions?

Sure. that's how it works here.

Apr 18 '25 11:04 potiuk