Scheduler fails to schedule DagRuns due to persistent DAG record lock
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.7.3
What happened?
We are encountering an issue in our Apache Airflow setup where, after a few successful DagRuns, the scheduler stops scheduling new runs. The scheduler logs indicate:
{scheduler_job_runner.py:1426} INFO - DAG dag-test scheduling was skipped, probably because the DAG record was locked.
This problem persists despite running a single scheduler pod. Notably, reverting the changes from PR #31414 resolves this issue. A similar issue has been discussed on Stack Overflow: Airflow Kubernetes Executor Scheduling Skipped Because Dag Record Was Locked.
What you think should happen instead?
The scheduler should consistently schedule new DagRuns as per DAG configurations, without interruption due to DAG record locks.
How to reproduce
Run airflow v.2.7.3 on kubernetes. HA is not required. Trigger multiple DagRuns (We have about 10 DAGs that run every minute). Observe scheduler behavior and logs after a few successful runs. The error shows up after a few minutes
Operating System
centos7
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.10.0 apache-airflow-providers-apache-hive==6.2.0 apache-airflow-providers-apache-livy==3.6.0 apache-airflow-providers-cncf-kubernetes==7.8.0 apache-airflow-providers-common-sql==1.8.0 apache-airflow-providers-ftp==3.6.0 apache-airflow-providers-google==10.11.0 apache-airflow-providers-http==4.6.0 apache-airflow-providers-imap==3.4.0 apache-airflow-providers-papermill==3.4.0 apache-airflow-providers-postgres==5.7.1 apache-airflow-providers-presto==5.2.1 apache-airflow-providers-salesforce==5.5.0 apache-airflow-providers-snowflake==5.1.0 apache-airflow-providers-sqlite==3.5.0 apache-airflow-providers-trino==5.4.0
Deployment
Other
Deployment details
We have wrappers around the official airflow helm chart and docker images.
Environment:
Airflow Version: 2.7.3
Kubernetes Version: 1.24
Executor: KubernetesExecutor
Database: PostgreSQL (metadata database)
Environment/Infrastructure: Kubernetes cluster running Airflow in Docker containers
Anything else?
Actual Behavior: The scheduler stops scheduling new runs after a few DagRuns, with log messages about the DAG record being locked.
Workaround: Restarting the scheduler pod releases the lock and allows normal scheduling to resume, but this is not viable in production. Reverting the changes in PR #31414 also resolves the issue.
Questions/Request for Information:
- Under what scenarios is the lock on a DAG record typically not released?
- Are there known issues in Airflow 2.7.3, or specific configurations, that might cause the DAG record to remain locked, thereby preventing new run scheduling?
- Could the changes made in PR #31414 be related to this issue?
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
@doiken FYA
I think what could help is to show more complete logs around the time when the deadlock occurs with logging level set to debug - that would help anyone who would analyse the problem. Can you please upload such log somewhere and link here @nookcreed ?
scheduler.txt @potiuk I have attached debug logs. It is harder to recreate the scenario with debug logging enabled. With info, it almost always happens within a few minutes.
@potiuk We are also facing a similar issue. I can help provide logs or any other details required.
Please. The more details, the more likely we will be able to find the cause
@potiuk I have same issue. It starts appearing after I upgrade Airflow from 2.5.1 et 2.7.3. All blocked dags start running after I recreate schedulers pods
Then send more info - someone might want to take a look
I am trying to recreate or capture logs, but what I've been noticing thus far on 2.8.1:
- For DAGs showing that "DAG record was locked", there is a task that is stuck in a running state, and never times out ~~- Happens only on Kubernetes spark submit operators tasks~~
Only info logs I have captured so far (from the task instance logs), which is probably no help. I will try to get some debug logs:
[2024-02-02, 06:01:45 UTC] {taskinstance.py:1946} INFO - Dependencies not met for <TaskInstance: l3-main-abc123.rlav.rlav manual__2024-02-01T05:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2024-02-02, 06:01:45 UTC] {taskinstance.py:1946} INFO - Dependencies not met for <TaskInstance: l3-main-abc123.rlav.rlav manual__2024-02-01T05:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2024-02-02, 06:01:45 UTC] {local_task_job_runner.py:160} INFO - Task is not able to be run
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
@potiuk can you please help remove the "pending-response" label?
We are also seeing same issue. We started getting this after switching deployments to official helm chart on 2.7.3
Sure. Removed. There is still not nearly enough evidences.lokely for anyone to be able.yo replicate and diagnose and make hypothesis about the issue, but I can remove the label. But I am afraid for anyone trying to diagnose it , the more evidences (rather than people saying they have the same issue) the bigger chance is someone will actually find and fix the issue.
I met the same issue. But after I updated the helm value file to use separated DAG Processor, I did not find this error in scheduler log anymore.
Just to add some information on our experience of this issue.
We deploy Airflow with Astronomer. We started seeing something like this issue in version 9.4.0 which includes Ariflow 2.7.2.
When I say 'this issue' I mean the following:
- DAGs start going into a skipped state and are not run.
- Initially this is sporadic and for a while it's possible to trigger DAGs manually.
- Eventually things degrade to a point where all DAGs skip and you can no longer get them to run by manually triggering them.
- We see logs like the following:
{scheduler_job_runner.py:1444} INFO - DAG our_dag scheduling was skipped, probably because the DAG record was locked. - We see some other suspicious logs but searching for those individually seems to indicate they are not part of the issue.
- We are running everything in Kubernetes, version 1.26.6.
- We have a single scheduler instance running in the namespace.
These are the provider versions we have installed and we're using Python 3.11:
apache-airflow-providers-cncf-kubernetes==7.14.0
apache-airflow-providers-databricks==4.6.0
apache-airflow-providers-microsoft-azure==8.0.0
apache-airflow-providers-microsoft-mssql==3.5.0
apache-airflow-providers-mongo==3.3.0
apache-airflow-providers-postgres==5.7.1
apache-airflow-providers-snowflake==5.3.0
We went over our DAGs in detail to see if we had introduced something terrible and we tried turning some DAGs off having the scheduler only running a couple of DAGs.
We tried upgrading to Airflow 2.8.0. That initially seemed to fix the issue but we eventually learned that any time we restart the scheduler, the issue resolves temporarily but always comes back. We also tried upgrading to 2.8.1.
We completely removed the Airflow deployment, recreating it from scratch, but the issue came back.
Finally, we downgraded to Airflow 2.7.0 and this seems to have solved the issue. We still have no idea what the actual cause of the issue was, only that it does not happen in 2.7.0 but does in all versions from 2.7.2 and above.
Can you raise it to Astronomer's support? I believe they provide paid service including support, and since they can have direct insight into what you see and be able to investigate thing much more thoroughly. Here people help when they have free time and maybe they will find problems, maybe not - but with paid support, it's quite a bit more likely you can expect investigation and diagnosis, especially that they can have a peek (with your pormission) at your installation (and since they are providing the service, it's pretty much on them to make sure it's configured properly) .
Did you try it @ruarfff ? What do they say?
@potiuk thank you. We did try Astronomer support but no luck there :) They couldn't figure it out.
I just wanted to add some extra information to this issue in case it might help but I am not expecting help solving the issue here. For now, we know we can stick to version 2.7.0 and things will work.
I'd suggest try again. If they could not figure it out with access to the system, then I am afraid it's not gonna be any easier here as people here cannot do any more diagnosis on your system, and here everyone is trying to help in their free time, when they feel like it, so there is little chance someone will easily reproduce it - following the description. There at least you have an easily reproducible environment, where Astronomer has full control over - Ideal situation to run diagnosis.
I think you should insist there. They have a lot of expertise there, and if they get strong signal that people don't upgrade because of that issue, AND the fact it happens in the controlled environment of Astronomer AND is easily reproducible there - makes it far more feasible to diagnose the problem. I migh ping a few people in Astronomer to take a closer look if you will help with reproducibility case there @ruarfff
@ruarfff Can you let me know your Astro support ticket number and we'll dig in to this a bit deeper. Edit: found the ticket now.
Okay, saying "They couldn't figure it out." isn't really fair, given that two things were going on in parallel on that support ticket, and the original problem was solved and then the ticket closed by a colleague of yours :)
Anyway, I've got our Airflow dev team looking at this now.
@ashb sorry, my bad. From my perspective it wasn't figured out but you're right it was in fact someone in our internal Astronomer support team who closed the ticket. Sorry about that.
Yeah I get that! S'alright. We might be in touch if we need help reproducing this.
Will it make sense to add a flag to not lock the dag_model which was introduced in this pr https://github.com/apache/airflow/pull/31414 which seem to be causing the problem
2.7.0 and this seems to have solved the issue. We
just tried out 2.7.0 and in our case it didnt help. 2.6.2 is for now the one without that message in our case.
I have spent some time trying to reproduce this but I haven't been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it's not a very serious one. cc @potiuk . If that's Ok, I will revert and include the change in 2.8.2
I have spent some time trying to reproduce this but I haven't been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it's not a very serious one. cc @potiuk . If that's Ok, I will revert and include the change in 2.8.2
Agree
I have spent some time trying to reproduce this but I haven't been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it's not a very serious one. cc @potiuk . If that's Ok, I will revert and include the change in 2.8.2
Agree
Done https://github.com/apache/airflow/pull/37596
Closing as the problematic PR has been reverted
We've had these issues lately and upgraded to 2.8.2 yesterday, but that didn't solve the issue.
The message regarding DAGs being locked is gone, but the actual issue of runs not being queued persists.
We've had these issues lately and upgraded to 2.8.2 yesterday, but that didn't solve the issue.
The message regarding DAGs being locked is gone, but the actual issue of runs not being queued persists.
I think without additional details of logs and what happens, it's hard to do anything. You must have some special circumstances that trigger it, but without knowing them it's hard to do anything to help you
It does look like something external is locking your Db - maybe you have some special DB (not real postgres) or maybe your postgres is special? I suggest to open a new issue and describe it in detail and try to find as many specific things as you can.