charts icon indicating copy to clipboard operation
charts copied to clipboard

feat: programmatically controlled dags

Open asosnovsky opened this issue 3 years ago • 5 comments

What does your PR do?

Adds support for programmatically pausing/unpausing of DAGS

Checklist

For all Pull Requests

asosnovsky avatar Sep 12 '22 17:09 asosnovsky

@asosnovsky this is an interesting idea, can you clarify the situations where someone would need/want this functionality?

Additionally, are you sure that every valid dag_id can be stored in a YAML map key? (It might make more sense to store the "controlled" dags in a YAML list of maps, with a key called dagID that stores a string value)

thesuperzapper avatar Sep 12 '22 23:09 thesuperzapper

So I'm managing a system where we have lots of deployments that require different subsets of dags. Additionally, sometimes engineers forget to turn on/off dags that they were debugging.

asosnovsky avatar Sep 13 '22 00:09 asosnovsky

@thesuperzapper do you mean like

controlled:
   - dagID: myawesomedag
   - dagID: mycooldag2
     disabled: false

asosnovsky avatar Sep 13 '22 00:09 asosnovsky

So I'm managing a system where we have lots of deployments that require different subsets of dags. Additionally, sometimes engineers forget to turn on/off dags that they were debugging.

@asosnovsky It really seems quite dangerous to use a control loop to force certain dags to be enabled/disabled. There are many situations when you may want to temporarily disable a dag to prevent/fix an issue, and with this, you would be fighting with the controller.

To solve your stated problem of people accidentally leaving dags disabled/enabled, it makes more sense to suggest an upstream change in airflow itself to alert/highlight when an important dag has been left disabled for some period of time. This would be similar to the existing SLA feature, but it's not quite the same, as SLAs only apply to enabled dags.

You may want to suggest any ideas/proposals you have around this in the airflow thread that is discussing a rewrite of the SLA feature, or possibly raise your own separate "airflow improvement proposal" for a "monitored dags" feature like I described above.

But either way, I don't think I am comfortable adding this "dag controller" feature to the chart as I think it's unsafe (feel free to explain why my thinking is wrong).

thesuperzapper avatar Sep 14 '22 01:09 thesuperzapper

@thesuperzapper I don't see how this relates to SLA. It has nothing to do with notifying when a dag takes longer than expected. But I do agree that it can be annoying/dangerous when you try to pause a DAG and it just bounces back after a minute.

Perhaps, if we just dropped the "Deployment" and kept only the "Job" implementation it would make a better compromise. That way, to return to "normal" you would have to use reapply the chart. It's anyways all I actually need, I.e. a clean way to return to the expected original state for a given deployment. If I didn't want someone to modify the DAG states I would just not give them that permission.

To strengthen the case for this. It's not uncommon (in my experience) for companies to require a special approval process to enable/permanently pause a DAG. Ensuring that there is a programatic way to sync the wanted state for all DAGs (by triggering the job using helm) using a nice infrastructure as code approach is what we use kubernetes for, and not necessarily should expect it from Airflow. (The alternatives I've seen usually involve writing a bash or python script that does the same, or writing a DAG that does the same, which is much less clean than using helm, and tend to have worst audit logs). Plus airflow does provide this feature by allowing users to modify the db dag states (which is what my implementation utilizes).

asosnovsky avatar Sep 14 '22 04:09 asosnovsky

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label

stale[bot] avatar Nov 19 '22 13:11 stale[bot]