airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Add indexes on dag_id column in refencing tables to speed up deletion of dag records

Open pankajkoti opened this issue 9 months ago • 2 comments

When the dag records count gets huge, and users try to delete DAG and DAG runs that are no longer needed or are stale, it is observed that the deletion is significantly slow. The reason for this is that the CASCADING DELETES are slow. Although, we have foreign key constraints in the referencing tables, they do not create an index implicitly on those columns (dag_id in the referencing tables in this case). Hence, we're creating indexes on the 6 referencing table for CASCADE DELETES to speed up the deletion of records. Without these indexes, it was observed that it takes many hours to delete those records and it reduced to a few seconds after adding those indexes.


^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

pankajkoti avatar May 09 '24 15:05 pankajkoti

Seems like for MySQL indexes for FK already exists

Taragolis avatar May 09 '24 15:05 Taragolis

Seems like for MySQL indexes for FK already exists

Wondering if I should put these operations under

if conn.dialect.name != "mysql":
    create_index....

or

if conn.dialect.name in  ("sqlite", "postgresql"):
    create_index....

WDYT is better?

pankajkoti avatar May 09 '24 17:05 pankajkoti

Closing in favour of https://github.com/apache/airflow/pull/39638

pankajkoti avatar May 15 '24 19:05 pankajkoti