airflow icon indicating copy to clipboard operation
airflow copied to clipboard

airflow db clean is unable to delete the table rendered_task_instance_fields

Open sai3563 opened this issue 3 years ago • 3 comments

Apache Airflow version

Other Airflow 2 version

What happened

Hi All,

When I run the below command in Airflow 2.3.4:

airflow db clean --clean-before-timestamp '2022-09-18T00:00:00+05:30' --yes

I receive an error within a warning which says

[2022-09-20 10:33:30,971] {db_cleanup.py:302} WARNING - Encountered error when attempting to clean table 'rendered_task_instance_fields'.

All other tables like log, dag, xcom get deleted properly. On my analysis, rendered_task_instance_fields was the 5th largest table by rows in the DB, so the impact of it's data size is significant.

On analyzing the table itself on PostGres 13 DB, I found that the table rendered_task_instance_fields has no timestamp column that records when the entry was inserted.

https://imgur.com/a/Qys2uwD

Thus, there would be no way the code can filter out older records and delete them.

What you think should happen instead

A timestamp field needs to be added to the table rendered_task_instance_fields basis of which older records can be deleted.

How to reproduce

Run the below command in airflow v2.3.4 and check the output.

airflow db clean --clean-before-timestamp '2022-09-18T00:00:00+05:30' --yes

Operating System

Ubuntu 20.04 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==5.1.0 apache-airflow-providers-celery==3.0.0 apache-airflow-providers-common-sql==1.2.0 apache-airflow-providers-ftp==3.1.0 apache-airflow-providers-google==8.3.0 apache-airflow-providers-http==4.0.0 apache-airflow-providers-imap==3.0.0 apache-airflow-providers-mongo==3.0.0 apache-airflow-providers-mysql==3.2.0 apache-airflow-providers-slack==5.1.0 apache-airflow-providers-sqlite==3.2.1

Deployment

Virtualenv installation

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

sai3563 avatar Sep 22 '22 04:09 sai3563

assigned you !

potiuk avatar Sep 22 '22 07:09 potiuk

@potiuk Can you describe what needs to be done? Should I figure out where the code for airflow db init is and add creation date column to the table rendered_task_instance_fields? What about existing users who are upgrading in that case? I want to ensure there is no edge case where the solution to this might break things.

sai3563 avatar Sep 22 '22 09:09 sai3563

The column needs to be added an migration needs to be added to add the column via alembic (and likely set it to current time for all records). You can see how migrations are done in the `airflow/migrations' folder and there is a chapter about adding migrations in https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#metadata-database-updates (and you can look in the past commits to that folder to get a feeling how such changes shoudl look like).

And we will make a review of the PR, so if there will be anything missing, we will help with guidance :)

potiuk avatar Sep 22 '22 09:09 potiuk

Instead of adding a column solely for the purposes of cleaning, we should consider not directly cleaning RTIF and letting the cascading FK on TI do the work for us.

jedcunningham avatar Sep 23 '22 03:09 jedcunningham

Hi @jedcunningham

Great Idea! But if we do that, wouldn't that make it auto delete records in RTIF if records in task_instance are also deleted?

https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html

As per the docs above, currently in parameters of airflow db clean, we have parameter --tables to delete tables & separate options are provided for TI and RTIF. With FK with cascading delete, wouldn't it break this behavior and cause RTIF deletion by just selecting TI in the parameter?

sai3563 avatar Sep 23 '22 06:09 sai3563

There is already that behavior if you only choose to clear TI - RTIF is dependent on the TI existing already.

The removal of execution_date/the transition to the TI FK AND the release of the db clean feature happened in 2.3.0, so it has never worked for RTIF!

If we still want to support cleaning RTIF directly, it should use the date from TI instead. I'd rather we remove RTIF as an option and just emit a warning if folks pass it in --tables (just to be extra careful we don't break peoples cli commands).

jedcunningham avatar Sep 23 '22 16:09 jedcunningham

@jedcunningham

If we still want to support cleaning RTIF directly, it should use the date from TI instead

Yes, this can be done, but from my quick code review of site-packages/airflow/utils/db_cleanup.py, it seems to be made in a way that we just pass table details and timestamp field and it will automatically do the deletion with the same logic for all tables

_cleanup_table(
                clean_before_timestamp=clean_before_timestamp,
                dry_run=dry_run,
                verbose=verbose,
                **table_config.__dict__,
                skip_archive=skip_archive,
                session=session,
            )

If we are going to delete via TI FK, I believe this would require a separate logic to be coded just to delete RTIF.

If we still want to support cleaning RTIF directly, it should use the date from TI instead. I'd rather we remove RTIF as an option and just emit a warning if folks pass it in --tables (just to be extra careful we don't break peoples cli commands).

This is exactly the behavior we are seeing at the moment, which I've described in my initial post, although yes we can change the warning to make it sound like an intended behavior rather than a bug.

Regardless of what path we choose from the above 2, please let me know as I would most likely work on this tomorrow & would need a clear path.

sai3563 avatar Sep 23 '22 17:09 sai3563

Sorry, I'm advocating for removing RTIF as an option altogether and if you run airflow db clean -t rendered_task_instance_fields you get a warning like this instead:

WARNING - Sorry, you cannot clean 'rendered_task_instance_fields' directly. It gets cleaned up along with 'task_instance'

Interestingly, we don't appear to error check tables we pass in anyways, so straight up removing it might be an option too.

cc @dstandish

jedcunningham avatar Sep 23 '22 17:09 jedcunningham

@jedcunningham

Alright then, what specific change would need to be done from my end in the code, if any?

sai3563 avatar Sep 23 '22 17:09 sai3563

I think just removing this line? A quick look makes me think everything else (e.g. help text) is okay as is because it's dynamically generated.

jedcunningham avatar Sep 23 '22 17:09 jedcunningham

@jedcunningham Alright, will do the same tomorrow.

sai3563 avatar Sep 23 '22 17:09 sai3563

Hey @sai3563, any luck?

jedcunningham avatar Oct 09 '22 00:10 jedcunningham