airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Scheduler fails to schedule DagRuns due to persistent DAG record lock

Open nookcreed opened this issue 1 year ago • 60 comments

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.7.3

What happened?

We are encountering an issue in our Apache Airflow setup where, after a few successful DagRuns, the scheduler stops scheduling new runs. The scheduler logs indicate:

{scheduler_job_runner.py:1426} INFO - DAG dag-test scheduling was skipped, probably because the DAG record was locked.

This problem persists despite running a single scheduler pod. Notably, reverting the changes from PR #31414 resolves this issue. A similar issue has been discussed on Stack Overflow: Airflow Kubernetes Executor Scheduling Skipped Because Dag Record Was Locked.

What you think should happen instead?

The scheduler should consistently schedule new DagRuns as per DAG configurations, without interruption due to DAG record locks.

How to reproduce

Run airflow v.2.7.3 on kubernetes. HA is not required. Trigger multiple DagRuns (We have about 10 DAGs that run every minute). Observe scheduler behavior and logs after a few successful runs. The error shows up after a few minutes

Operating System

centos7

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.10.0 apache-airflow-providers-apache-hive==6.2.0 apache-airflow-providers-apache-livy==3.6.0 apache-airflow-providers-cncf-kubernetes==7.8.0 apache-airflow-providers-common-sql==1.8.0 apache-airflow-providers-ftp==3.6.0 apache-airflow-providers-google==10.11.0 apache-airflow-providers-http==4.6.0 apache-airflow-providers-imap==3.4.0 apache-airflow-providers-papermill==3.4.0 apache-airflow-providers-postgres==5.7.1 apache-airflow-providers-presto==5.2.1 apache-airflow-providers-salesforce==5.5.0 apache-airflow-providers-snowflake==5.1.0 apache-airflow-providers-sqlite==3.5.0 apache-airflow-providers-trino==5.4.0

Deployment

Other

Deployment details

We have wrappers around the official airflow helm chart and docker images.

Environment:

Airflow Version: 2.7.3
Kubernetes Version: 1.24
Executor: KubernetesExecutor
Database: PostgreSQL (metadata database)
Environment/Infrastructure: Kubernetes cluster running Airflow in Docker containers

Anything else?

Actual Behavior: The scheduler stops scheduling new runs after a few DagRuns, with log messages about the DAG record being locked.

Workaround: Restarting the scheduler pod releases the lock and allows normal scheduling to resume, but this is not viable in production. Reverting the changes in PR #31414 also resolves the issue.

Questions/Request for Information:

  1. Under what scenarios is the lock on a DAG record typically not released?
  2. Are there known issues in Airflow 2.7.3, or specific configurations, that might cause the DAG record to remain locked, thereby preventing new run scheduling?
  3. Could the changes made in PR #31414 be related to this issue?

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

nookcreed avatar Jan 19 '24 21:01 nookcreed

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

boring-cyborg[bot] avatar Jan 19 '24 21:01 boring-cyborg[bot]

@doiken FYA

nookcreed avatar Jan 19 '24 21:01 nookcreed

I think what could help is to show more complete logs around the time when the deadlock occurs with logging level set to debug - that would help anyone who would analyse the problem. Can you please upload such log somewhere and link here @nookcreed ?

potiuk avatar Jan 20 '24 16:01 potiuk

scheduler.txt @potiuk I have attached debug logs. It is harder to recreate the scenario with debug logging enabled. With info, it almost always happens within a few minutes.

nookcreed avatar Jan 20 '24 18:01 nookcreed

@potiuk We are also facing a similar issue. I can help provide logs or any other details required.

arvind1705 avatar Jan 31 '24 08:01 arvind1705

Please. The more details, the more likely we will be able to find the cause

potiuk avatar Jan 31 '24 09:01 potiuk

@potiuk I have same issue. It starts appearing after I upgrade Airflow from 2.5.1 et 2.7.3. All blocked dags start running after I recreate schedulers pods

selimchergui avatar Jan 31 '24 14:01 selimchergui

Then send more info - someone might want to take a look

potiuk avatar Jan 31 '24 14:01 potiuk

I am trying to recreate or capture logs, but what I've been noticing thus far on 2.8.1:

  • For DAGs showing that "DAG record was locked", there is a task that is stuck in a running state, and never times out ~~- Happens only on Kubernetes spark submit operators tasks~~

Only info logs I have captured so far (from the task instance logs), which is probably no help. I will try to get some debug logs:

[2024-02-02, 06:01:45 UTC] {taskinstance.py:1946} INFO - Dependencies not met for <TaskInstance: l3-main-abc123.rlav.rlav manual__2024-02-01T05:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2024-02-02, 06:01:45 UTC] {taskinstance.py:1946} INFO - Dependencies not met for <TaskInstance: l3-main-abc123.rlav.rlav manual__2024-02-01T05:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2024-02-02, 06:01:45 UTC] {local_task_job_runner.py:160} INFO - Task is not able to be run

plmnguyen avatar Feb 02 '24 16:02 plmnguyen

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] avatar Feb 20 '24 00:02 github-actions[bot]

@potiuk can you please help remove the "pending-response" label?

nookcreed avatar Feb 20 '24 01:02 nookcreed

We are also seeing same issue. We started getting this after switching deployments to official helm chart on 2.7.3

rahulnbhandari avatar Feb 20 '24 01:02 rahulnbhandari

Sure. Removed. There is still not nearly enough evidences.lokely for anyone to be able.yo replicate and diagnose and make hypothesis about the issue, but I can remove the label. But I am afraid for anyone trying to diagnose it , the more evidences (rather than people saying they have the same issue) the bigger chance is someone will actually find and fix the issue.

potiuk avatar Feb 20 '24 05:02 potiuk

I met the same issue. But after I updated the helm value file to use separated DAG Processor, I did not find this error in scheduler log anymore.

mis98zb avatar Feb 20 '24 09:02 mis98zb

Just to add some information on our experience of this issue.

We deploy Airflow with Astronomer. We started seeing something like this issue in version 9.4.0 which includes Ariflow 2.7.2.

When I say 'this issue' I mean the following:

  • DAGs start going into a skipped state and are not run.
  • Initially this is sporadic and for a while it's possible to trigger DAGs manually.
  • Eventually things degrade to a point where all DAGs skip and you can no longer get them to run by manually triggering them.
  • We see logs like the following: {scheduler_job_runner.py:1444} INFO - DAG our_dag scheduling was skipped, probably because the DAG record was locked.
  • We see some other suspicious logs but searching for those individually seems to indicate they are not part of the issue.
  • We are running everything in Kubernetes, version 1.26.6.
  • We have a single scheduler instance running in the namespace.

These are the provider versions we have installed and we're using Python 3.11:

apache-airflow-providers-cncf-kubernetes==7.14.0
apache-airflow-providers-databricks==4.6.0
apache-airflow-providers-microsoft-azure==8.0.0
apache-airflow-providers-microsoft-mssql==3.5.0
apache-airflow-providers-mongo==3.3.0
apache-airflow-providers-postgres==5.7.1
apache-airflow-providers-snowflake==5.3.0

We went over our DAGs in detail to see if we had introduced something terrible and we tried turning some DAGs off having the scheduler only running a couple of DAGs.

We tried upgrading to Airflow 2.8.0. That initially seemed to fix the issue but we eventually learned that any time we restart the scheduler, the issue resolves temporarily but always comes back. We also tried upgrading to 2.8.1.

We completely removed the Airflow deployment, recreating it from scratch, but the issue came back.

Finally, we downgraded to Airflow 2.7.0 and this seems to have solved the issue. We still have no idea what the actual cause of the issue was, only that it does not happen in 2.7.0 but does in all versions from 2.7.2 and above.

ruarfff avatar Feb 20 '24 09:02 ruarfff

Can you raise it to Astronomer's support? I believe they provide paid service including support, and since they can have direct insight into what you see and be able to investigate thing much more thoroughly. Here people help when they have free time and maybe they will find problems, maybe not - but with paid support, it's quite a bit more likely you can expect investigation and diagnosis, especially that they can have a peek (with your pormission) at your installation (and since they are providing the service, it's pretty much on them to make sure it's configured properly) .

Did you try it @ruarfff ? What do they say?

potiuk avatar Feb 20 '24 11:02 potiuk

@potiuk thank you. We did try Astronomer support but no luck there :) They couldn't figure it out.

I just wanted to add some extra information to this issue in case it might help but I am not expecting help solving the issue here. For now, we know we can stick to version 2.7.0 and things will work.

ruarfff avatar Feb 20 '24 13:02 ruarfff

I'd suggest try again. If they could not figure it out with access to the system, then I am afraid it's not gonna be any easier here as people here cannot do any more diagnosis on your system, and here everyone is trying to help in their free time, when they feel like it, so there is little chance someone will easily reproduce it - following the description. There at least you have an easily reproducible environment, where Astronomer has full control over - Ideal situation to run diagnosis.

I think you should insist there. They have a lot of expertise there, and if they get strong signal that people don't upgrade because of that issue, AND the fact it happens in the controlled environment of Astronomer AND is easily reproducible there - makes it far more feasible to diagnose the problem. I migh ping a few people in Astronomer to take a closer look if you will help with reproducibility case there @ruarfff

potiuk avatar Feb 20 '24 13:02 potiuk

@ruarfff Can you let me know your Astro support ticket number and we'll dig in to this a bit deeper. Edit: found the ticket now.

Okay, saying "They couldn't figure it out." isn't really fair, given that two things were going on in parallel on that support ticket, and the original problem was solved and then the ticket closed by a colleague of yours :)

Anyway, I've got our Airflow dev team looking at this now.

ashb avatar Feb 20 '24 13:02 ashb

@ashb sorry, my bad. From my perspective it wasn't figured out but you're right it was in fact someone in our internal Astronomer support team who closed the ticket. Sorry about that.

ruarfff avatar Feb 20 '24 14:02 ruarfff

Yeah I get that! S'alright. We might be in touch if we need help reproducing this.

ashb avatar Feb 20 '24 14:02 ashb

Will it make sense to add a flag to not lock the dag_model which was introduced in this pr https://github.com/apache/airflow/pull/31414 which seem to be causing the problem

rahulnbhandari avatar Feb 20 '24 14:02 rahulnbhandari

2.7.0 and this seems to have solved the issue. We

just tried out 2.7.0 and in our case it didnt help. 2.6.2 is for now the one without that message in our case.

gr8web avatar Feb 21 '24 09:02 gr8web

I have spent some time trying to reproduce this but I haven't been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it's not a very serious one. cc @potiuk . If that's Ok, I will revert and include the change in 2.8.2

ephraimbuddy avatar Feb 21 '24 14:02 ephraimbuddy

I have spent some time trying to reproduce this but I haven't been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it's not a very serious one. cc @potiuk . If that's Ok, I will revert and include the change in 2.8.2

Agree

potiuk avatar Feb 21 '24 14:02 potiuk

I have spent some time trying to reproduce this but I haven't been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it's not a very serious one. cc @potiuk . If that's Ok, I will revert and include the change in 2.8.2

Agree

Done https://github.com/apache/airflow/pull/37596

ephraimbuddy avatar Feb 21 '24 16:02 ephraimbuddy

Closing as the problematic PR has been reverted

ephraimbuddy avatar Feb 21 '24 17:02 ephraimbuddy

We've had these issues lately and upgraded to 2.8.2 yesterday, but that didn't solve the issue.

The message regarding DAGs being locked is gone, but the actual issue of runs not being queued persists.

ssudrich-soundhound avatar Feb 29 '24 08:02 ssudrich-soundhound

We've had these issues lately and upgraded to 2.8.2 yesterday, but that didn't solve the issue.

The message regarding DAGs being locked is gone, but the actual issue of runs not being queued persists.

I think without additional details of logs and what happens, it's hard to do anything. You must have some special circumstances that trigger it, but without knowing them it's hard to do anything to help you

potiuk avatar Feb 29 '24 08:02 potiuk

It does look like something external is locking your Db - maybe you have some special DB (not real postgres) or maybe your postgres is special? I suggest to open a new issue and describe it in detail and try to find as many specific things as you can.

potiuk avatar Feb 29 '24 08:02 potiuk