airflow db clean is unable to delete the table rendered_task_instance_fields
Apache Airflow version
Other Airflow 2 version
What happened
Hi All,
When I run the below command in Airflow 2.3.4:
airflow db clean --clean-before-timestamp '2022-09-18T00:00:00+05:30' --yes
I receive an error within a warning which says
[2022-09-20 10:33:30,971] {db_cleanup.py:302} WARNING - Encountered error when attempting to clean table 'rendered_task_instance_fields'.
All other tables like log, dag, xcom get deleted properly. On my analysis, rendered_task_instance_fields was the 5th largest table by rows in the DB, so the impact of it's data size is significant.
On analyzing the table itself on PostGres 13 DB, I found that the table rendered_task_instance_fields has no timestamp column that records when the entry was inserted.
https://imgur.com/a/Qys2uwD
Thus, there would be no way the code can filter out older records and delete them.
What you think should happen instead
A timestamp field needs to be added to the table rendered_task_instance_fields basis of which older records can be deleted.
How to reproduce
Run the below command in airflow v2.3.4 and check the output.
airflow db clean --clean-before-timestamp '2022-09-18T00:00:00+05:30' --yes
Operating System
Ubuntu 20.04 LTS
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==5.1.0 apache-airflow-providers-celery==3.0.0 apache-airflow-providers-common-sql==1.2.0 apache-airflow-providers-ftp==3.1.0 apache-airflow-providers-google==8.3.0 apache-airflow-providers-http==4.0.0 apache-airflow-providers-imap==3.0.0 apache-airflow-providers-mongo==3.0.0 apache-airflow-providers-mysql==3.2.0 apache-airflow-providers-slack==5.1.0 apache-airflow-providers-sqlite==3.2.1
Deployment
Virtualenv installation
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
assigned you !
@potiuk Can you describe what needs to be done? Should I figure out where the code for airflow db init is and add creation date column to the table rendered_task_instance_fields? What about existing users who are upgrading in that case? I want to ensure there is no edge case where the solution to this might break things.
The column needs to be added an migration needs to be added to add the column via alembic (and likely set it to current time for all records). You can see how migrations are done in the `airflow/migrations' folder and there is a chapter about adding migrations in https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#metadata-database-updates (and you can look in the past commits to that folder to get a feeling how such changes shoudl look like).
And we will make a review of the PR, so if there will be anything missing, we will help with guidance :)
Instead of adding a column solely for the purposes of cleaning, we should consider not directly cleaning RTIF and letting the cascading FK on TI do the work for us.
Hi @jedcunningham
Great Idea! But if we do that, wouldn't that make it auto delete records in RTIF if records in task_instance are also deleted?
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html
As per the docs above, currently in parameters of airflow db clean, we have parameter --tables to delete tables & separate options are provided for TI and RTIF. With FK with cascading delete, wouldn't it break this behavior and cause RTIF deletion by just selecting TI in the parameter?
There is already that behavior if you only choose to clear TI - RTIF is dependent on the TI existing already.
The removal of execution_date/the transition to the TI FK AND the release of the db clean feature happened in 2.3.0, so it has never worked for RTIF!
If we still want to support cleaning RTIF directly, it should use the date from TI instead. I'd rather we remove RTIF as an option and just emit a warning if folks pass it in --tables (just to be extra careful we don't break peoples cli commands).
@jedcunningham
If we still want to support cleaning RTIF directly, it should use the date from TI instead
Yes, this can be done, but from my quick code review of site-packages/airflow/utils/db_cleanup.py, it seems to be made in a way that we just pass table details and timestamp field and it will automatically do the deletion with the same logic for all tables
_cleanup_table(
clean_before_timestamp=clean_before_timestamp,
dry_run=dry_run,
verbose=verbose,
**table_config.__dict__,
skip_archive=skip_archive,
session=session,
)
If we are going to delete via TI FK, I believe this would require a separate logic to be coded just to delete RTIF.
If we still want to support cleaning RTIF directly, it should use the date from TI instead. I'd rather we remove RTIF as an option and just emit a warning if folks pass it in --tables (just to be extra careful we don't break peoples cli commands).
This is exactly the behavior we are seeing at the moment, which I've described in my initial post, although yes we can change the warning to make it sound like an intended behavior rather than a bug.
Regardless of what path we choose from the above 2, please let me know as I would most likely work on this tomorrow & would need a clear path.
Sorry, I'm advocating for removing RTIF as an option altogether and if you run airflow db clean -t rendered_task_instance_fields you get a warning like this instead:
WARNING - Sorry, you cannot clean 'rendered_task_instance_fields' directly. It gets cleaned up along with 'task_instance'
Interestingly, we don't appear to error check tables we pass in anyways, so straight up removing it might be an option too.
cc @dstandish
@jedcunningham
Alright then, what specific change would need to be done from my end in the code, if any?
I think just removing this line? A quick look makes me think everything else (e.g. help text) is okay as is because it's dynamically generated.
@jedcunningham Alright, will do the same tomorrow.
Hey @sai3563, any luck?