openverse
openverse copied to clipboard
Add variable to disable removing SQL source files for ingestion workflows
Description
The iNaturalist DAG uses the ingestion workflow's sql_rm_source_data_after_ingesting parameter to determine whether it should remove or retain the source files used for ingestion:
https://github.com/WordPress/openverse/blob/2cffcb9f8da6961e84a00854a3cd472fd0f9dad8/catalog/dags/providers/provider_dag_factory.py#L422-L430
While this is useful for specific runs, the iNaturalist DAG is scheduled, which means that the default run that gets kicked off locally when the DAG is enabled will remove the source files. Since these can be quite large, it's tedious and time consuming to have to manage triggering each run with the sql_rm_source_data_after_ingesting box unchecked.
We should also have an Airflow Variable which will also determine whether the files should be removed or not. The value could potentially be SQL_RM_SOURCE_DATA_AFTER_INGESTION, meaning the name of the variable added to our env.template file would be AIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION. This should be True by default in the code, but False as defined in the env.template file so by default, local runs will save source files.
We will also need to update the short-circuit task for skipping this to include checking this variable as well:
https://github.com/WordPress/openverse/blob/dca01105cf8ac6e14edb0dffacaf8ea8d2d01632/catalog/dags/providers/provider_api_scripts/inaturalist.py#L347-L355
The check should be such that if either the param or the Airflow Variable are set to False, the files are retained. We should be able to use the {{ var.json.<variable_name> }} syntax for templating this into the op_args similar to the param.
Additional context
See #3846 for the impetus
Hi,
I would like to work on this issue!
Fantastic, welcome @Pqformeln! I'll assign the issue to you 😄 Please check out our welcome and quickstart documentation pages, and if you have any questions about this issue feel free to leave them here!
I'd love to take on this @AetherUnbound @obulat
I'll assign you @madewithkode! Let us know if you have any questions about it 😄
Thanks @AetherUnbound. I think I have basically done what the task requires(not exactly sure how to test though). Can I go ahead and sort of open a draft MR so someone can take a look and see if my understanding of the requirements is correct?
Absolutely!
Hi @AetherUnbound, I have now made a draft PR here so we can confirm if I'm in the right direction and also provide ways to test.