airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Fix backfill queued task getting reset to scheduled state.

Open snjypl opened this issue 3 years ago • 6 comments

the backfill queued tasks were getting reset to scheduled state by kubernetes_executer.clear_not_launched_queued_tasks of scheduler instance.

this fixes the issue by making clear_not_launched_queued_tasks only handle the tasks which are created by its own scheduler/job.

closes: #23356 , closes: #23693 , closes: #23145

snjypl avatar May 16 '22 10:05 snjypl

@ashb @ephraimbuddy @dstandish @jedcunningham -> not sure if that's the right fix, But I guess this addresses the issues for 2.3.1

potiuk avatar May 18 '22 07:05 potiuk

Having the same issue in airflow==2.2.5. Applying a slightly adapted patch from this pull request fixes the problem with broken backfill.

ei-grad avatar May 19 '22 17:05 ei-grad

Hi team! New to the scene here, but we are also facing this bug as well in our implementation (2.3.3, KubernetesExecutor). Wondering if this PR could get a second pair of eyes. Similar to what is mentioned in the discussion, we'll have backfills that run (some to completion) but then get stuck because they are later marked scheduled.

All tasks getting marked as scheduled after already completing their backfill: image

A dag run that completes one task but has its subsequent task marked as scheduled: image

vchiapaikeo avatar Aug 03 '22 21:08 vchiapaikeo

I don't think we should change how Kubernetes executor handles queued task instances that have overstayed in queued state. If a task stayed in queued state for a long time, then something is wrong and the task should be cleared since we have a configuration to detect and clear it. While https://github.com/apache/airflow/pull/26205 can also solve this issue, I think it's not necessary to send a task that's not in scheduled state to the executor. This is missing right now in backfill because the possibility to have many tasks set to scheduled was solved in https://github.com/apache/airflow/pull/26161.

ephraimbuddy avatar Sep 08 '22 11:09 ephraimbuddy

I don't think we should change how Kubernetes executor handles queued task instances that have overstayed in queued state. If a task stayed in queued state for a long time, then something is wrong and the task should be cleared since we have a configuration to detect and clear it. While #26205 can also solve this issue, I think it's not necessary to send a task that's not in scheduled state to the executor. This is missing right now in backfill because the possibility to have many tasks set to scheduled was solved in #26161.

After consultations, I'm withdrawing the above statement.

ephraimbuddy avatar Sep 12 '22 19:09 ephraimbuddy

Can you add tests to ensure this is not altered in the future?

ephraimbuddy avatar Sep 16 '22 11:09 ephraimbuddy

Hi all. I can confirm I still see this issue on 2.4.0 KubernetesExecutor. Would this be available in 2.4.2? I see the milestone has being pushed several times. We are extremely interested in this fix

agomez-etsy avatar Sep 28 '22 15:09 agomez-etsy

Hi all. I can confirm I still see this issue on 2.4.0 KubernetesExecutor. Would this be available in 2.4.2? I see the milestone has being pushed several times. We are extremely interested in this fix

We need tests on this one

ephraimbuddy avatar Oct 03 '22 20:10 ephraimbuddy

For what it's worth, we just updated to 2.4.1 yesterday (w/ KubernetesExecutor) and we are unable to repro the previous buggy behavior where the scheduler would change the status of backfill runs. @ephraimbuddy , I think your PR may have mitigated or resolved this issue -https://github.com/apache/airflow/pull/26342/files.

By fixed, I mean that we can now start a backfill with the airflow dags backfill command and have it run to completion. It no longer seems to be getting interrupted mid-way by a scheduler process that is resetting its state.

image

vchiapaikeo avatar Oct 04 '22 12:10 vchiapaikeo

Argh, I may have spoken too soon. Another stuck one 😢

image

vchiapaikeo avatar Oct 04 '22 12:10 vchiapaikeo

One more note - we see these log lines directly before the backfill stalls. Which I think is in line with what this PR will address since @snjypl is modifying the function clear_not_launched_queued_tasks.

[2022-10-04 18:37:06,838] {scheduler_job.py:1374} INFO - Resetting orphaned tasks for active dag runs
[2022-10-04 18:37:08,918] {kubernetes_executor.py:470} INFO - Found 2 queued task instances
[2022-10-04 18:37:08,930] {kubernetes_executor.py:511} INFO - TaskInstance: <TaskInstance: test_spark_yaml_1664908327.dataproc_cluster.check_failed_jobs backfill__2022-09-06T00:00:00+00:00 [queued]> found in queued state but was not launched, rescheduling
[2022-10-04 18:37:08,944] {kubernetes_executor.py:511} INFO - TaskInstance: <TaskInstance: test_spark_yaml_1664908327.dataproc_cluster2.check_failed_jobs backfill__2022-09-06T00:00:00+00:00 [queued]> found in queued state but was not launched, rescheduling

vchiapaikeo avatar Oct 04 '22 18:10 vchiapaikeo

Can you add tests to ensure this is not altered in the future?

@ephraimbuddy , i have added some tests, can you please review it and let me know if i needto change/update anything. thank you for help !

snjypl avatar Oct 04 '22 20:10 snjypl

Needs conflict resolving.

potiuk avatar Oct 25 '22 19:10 potiuk

Some strange errors to fix.

potiuk avatar Oct 31 '22 04:10 potiuk

Looks like the only meaningful error is:

Run black (python formatter) on other..................................................Failed
- hook id: black
- duration: 67.5s
- files were modified by this hook

reformatted tests/executors/test_kubernetes_executor.py

ei-grad avatar Nov 01 '22 11:11 ei-grad

Yep. You need to fix it - I recommend installing pre-commit (as per our contribution docs). It will fix it for you.

potiuk avatar Nov 01 '22 12:11 potiuk

Make sure to rebase BTW.

potiuk avatar Nov 01 '22 12:11 potiuk

Yep. You need to fix it - I recommend installing pre-commit (as per our contribution docs). It will fix it for you.

@potiuk i have fixed the issue with black formatter by installing pre commit hook. there is still a failing task. Tests / Sqlite Py3.7: API Always CLI Core Integration Other Providers WWW

does it need to be addressed?

snjypl avatar Nov 09 '22 14:11 snjypl

Is there anything I can do to help move this one along? We rely on the airflow backfill CLI and this bug really makes it difficult for our users to backfill anything.

Anyone have any workarounds that I can implement now without waiting for this to be merged in?

DMilmont avatar Nov 11 '22 16:11 DMilmont

Is there anything I can do to help move this one along? We rely on the airflow backfill CLI and this bug really makes it difficult for our users to backfill anything.

You can patch your airflow with those changes and verify that they work in your case. That would be an absolutely awesome input and one that can speed up the merge process if we have some evidences that people are eager to get it, even to the extend that they decide to patch their airflow without waiting for the next release.

The change looks like easy to patch manually.

potiuk avatar Nov 11 '22 16:11 potiuk

I went ahead and manually patched this proposed fix into our airflow environment.

We average around 30 backfills per day, and over the last 4 days since it was manually patched - I have not seen or heard of this issue popping up. So far it seems that the fix has resolved the issue.

If it pops up again before this is merged I will report back. Just wanted to update with some test results.

DMilmont avatar Nov 17 '22 17:11 DMilmont

If it pops up again before this is merged I will report back. Just wanted to update with some test results.

Cool. Thanks for that !

potiuk avatar Nov 18 '22 01:11 potiuk

Thanks @snjypl

potiuk avatar Nov 18 '22 01:11 potiuk