airflow icon indicating copy to clipboard operation
airflow copied to clipboard

The grid view shows incorrect DAG runs

Open zachliu opened this issue 3 years ago • 18 comments

Apache Airflow version

2.4.0

What happened

Airflow 2.4.0's grid view has an issue:

Grid view stop showing the latest DAG runs in 2.4.0 (regardless of schedule, it shows old dag runs)

  • case 1 (all 25 are old dag runs) 2022-09-19_16-12

  • case 2 (old and new dag runs are mixed) 2022-09-20_17-59 2022-09-20_17-50

What you think should happen instead

No response

How to reproduce

  1. Go to a DAG with >365 dag runs
  2. The place where it's supposed to show the latest dag run in the "grid view" now shows an old dag run ~1 year ago

Operating System

Linux Mint 20.3 Una

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

zachliu avatar Sep 19 '22 20:09 zachliu

most likely due to https://github.com/apache/airflow/pull/25633 and https://github.com/apache/airflow/pull/25880 @uranusjr @bbovenzi any idea how to solve this? i'm not familiar with the "timetable" part :sweat_smile:

zachliu avatar Sep 20 '22 22:09 zachliu

Hi there! Follow up from dev ML.

We faced exactly the same issue after upgrade from 2.3.4 to 2.4.0 for all our dags. All of them has quite a history up to 2016 year, so a lot of dag runs. Somehow grid show for us latest runs at 2021.11 while other views (calendar, graph) shows all of them and the most recent ones.

The exact fix which solved this problem is:

--- airflow/www/views.py
+++ airflow/www/views.py
@@ -3453,8 +3460,9 @@ class Airflow(AirflowBaseView):
             if run_state:
                 query = query.filter(DagRun.state == run_state)
 
-            ordering = (DagRun.__table__.columns[name].desc() for name in dag.timetable.run_ordering)
-            dag_runs = query.order_by(*ordering, DagRun.id.desc()).limit(num_runs).all()
+            dag_runs = query.order_by(DagRun.execution_date.desc()).limit(num_runs).all()
             dag_runs.reverse()
 
             encoded_runs = [wwwutils.encode_dag_run(dr) for dr in dag_runs]

It feels like there is a problem between data and logic that operates it, but not sure where and what is it.

kxepal avatar Sep 21 '22 10:09 kxepal

@uranusjr Can you take a look at this ASAP please so we can get a fix for this out in 2.4.1?

ashb avatar Sep 21 '22 11:09 ashb

@kxepal What is the timetable/schedule parameter for the DAG(s) with this problem?

ashb avatar Sep 21 '22 12:09 ashb

@ashb we have good old timedelta schedule interval value everywhere. No timetables at all.

kxepal avatar Sep 21 '22 12:09 kxepal

@kxepal doesn't the fix actually revert the core of https://github.com/apache/airflow/pull/25633? i also did that because i "accidentally" deployed airflow 2.4.0 to our production :shushing_face:

zachliu avatar Sep 21 '22 14:09 zachliu

@zachliu That PR contains a lot of changes, but for us was enough to revert those changes that in the diff. The rest of them causes no troubles.

kxepal avatar Sep 21 '22 14:09 kxepal

@kxepal taking a wild guess here, maybe a if can solve all: if timetables are used, do that; else, do this

zachliu avatar Sep 21 '22 14:09 zachliu

@zachliu probably. We don't use timetables, so it was clear fix for us.

kxepal avatar Sep 21 '22 14:09 kxepal

@kxepal same here, i'm waiting for the distant future "calendar"-like scheduling system :wink:

zachliu avatar Sep 21 '22 14:09 zachliu

Everything is converted to a timetable now under the hood (and had been since timetables were introduced) so it's not as simple as an if.

ashb avatar Sep 21 '22 15:09 ashb

aww... wrong guess :crying_cat_face: it was based on the fact that some old dag runs don't have the Data interval start/end parameters (see case 2 screenshots)

zachliu avatar Sep 21 '22 15:09 zachliu

@ashb What information could be useful to debug this issue?

kxepal avatar Sep 21 '22 20:09 kxepal

@kxepal Creating a minimal stand-alone reproduction case if you can ("here's a dag, run it/backfill it for X" sort of thing)

ashb avatar Sep 21 '22 21:09 ashb

@kxepal Creating a minimal stand-alone reproduction case if you can ("here's a dag, run it/backfill it for X" sort of thing)

i tried that, running the same dag many times in a short period (backfilling) doesn't help, they will be ordered properly. i guess because they are all converted to timetables uniformly

it seems the issue only occurs on dags that are older than the introduction of timetable, ~which suggests that during the conversion, older dag runs lack attribute(s) for a proper sorting~ as i mentioned before, for older dag runs whose data_interval_end is NULL, the order_by(*ordering, DagRun.id.desc()) doesn't work

it's basically

ordering = (DagRun.__table__.columns[name].desc() for name in dag.timetable.run_ordering)
dag_runs = query.order_by(*ordering, DagRun.id.desc()).limit(num_runs).all()
SELECT dag_id,
       run_id,
       data_interval_end
FROM dag_run
WHERE dag_id = '<dag_name>'
ORDER BY data_interval_end DESC,
         execution_date DESC
LIMIT 25;

vs

dag_runs = query.order_by(DagRun.execution_date.desc()).limit(num_runs).all()
SELECT dag_id,
       run_id,
       data_interval_end
FROM dag_run
WHERE dag_id = '<dag_name>'
ORDER BY execution_date DESC
LIMIT 25;

zachliu avatar Sep 21 '22 21:09 zachliu

Creating a minimal stand-alone reproduction case if you can ("here's a dag, run it/backfill it for X" sort of thing)

Oblivious, but it be quite a hard to provide. @zachliu feels like walk on this path with no luck. But it looks like the problem could be solved by a simple new migration script. But what the cases it should handle and how it should fix them?

kxepal avatar Sep 21 '22 21:09 kxepal

can the migration script simply fills all the NULL data_interval_end with execution_date?

zachliu avatar Sep 21 '22 22:09 zachliu

But why it already doesn't? It doesn't feels like a problem with current logic - it works fine, but somewhere in changes between two versions. Actually, running new dagruns with 2.4 without a fix doesn't change anything.

kxepal avatar Sep 21 '22 22:09 kxepal

Filling all data interval with execution date would be excruciatingly slow, which is why it’s not done. Doing some fancy coalesce for ordering is probably possible.

uranusjr avatar Sep 23 '22 08:09 uranusjr

I created a PR for this. Would be awesome if someone could test it!

uranusjr avatar Sep 23 '22 09:09 uranusjr

Fixed by #26626

ashb avatar Sep 26 '22 12:09 ashb