openverse
openverse copied to clipboard
Add variable to disable removing sql source files during ingestion
Fixes
Fixes #3847 by @AetherUnbound
Description
Added an Airflow Variable AIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION which would be used to control the removal or retention of source files used for ingestion.
Testing Instructions
Checklist
- [x] My pull request has a descriptive title (not a vague title like
Update index.md). - [x] My pull request targets the default branch of the repository (
main) or a parent feature branch. - [x] My commit messages follow best practices.
- [x] My code follows the established code style of the repository.
- [ ] I added or updated tests for the changes I made (if applicable).
- [ ] I added or updated documentation (if applicable).
- [x] I tried running the project locally and verified that there are no visible errors.
- [ ] I ran the DAG documentation generator (if applicable).
Developer Certificate of Origin
Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
Oh, it looks like we'll also want to add that condition here too: https://github.com/WordPress/openverse/blob/c0105eb1b89aae9d3548f42ef69be52c5e863609/catalog/dags/providers/provider_api_scripts/inaturalist.py#L331
Apologies for missing that! Using the Variable there as well will prevent the dataset from being redownloaded every time.
@AetherUnbound all suggested code changes have been made.
P.S: I am unable to verify the intended behavior on my end due to repeated failures while trying to download the inaturalist_workflow source files(they seem quite large and I have a not-so-reliable internet). I'm hoping there's a way for you to verify and okay this from your end.
@madewithkode thanks for making those changes! I was able to test this locally, and opted to make two other changes:
- Set the default of the parameter to
False, since we now have the Variable which can act as a default/fallback - Change the condition for removing the files to
or, so only one of them has to be specified as true in order to ensure deletion
This has the effect making it so that either the parameter or the Variable being set will remove the files, but by default locally, they won't be removed. In production, since we have the Variable set to true, it will always be removed.
I was able to verify this behavior by:
- Run the DAG once (so the file was downloaded) and check that the cleanup tasks were skipped
- Mark the DAG as failed after it is completed (so a follow-up run doesn't skip during the "check previous success" step)
- Run the DAG again and verify that the
load_catalog_of_life_namestasks doesn't have to redownload the COL files.
All that to say, this looks good to me and can be taken out of draft when you're ready!
@AetherUnbound All these makes sense. Proceeding to unmark the PR as draft now. Thank you for the extra effort!