data.gov
data.gov copied to clipboard
Harvest2.0 DB cleaning
User Story
In order to keep the harvesting2.0 DB light and not fill up indefinitely, data.gov admins want a script to cleanup unnecessary data from the DB without losing current information in use.
Acceptance Criteria
- [ ] GIVEN the harvesting DB is unnecessarily large (extra harvest records)
WHEN the DB cleaning script is run
THEN the desired data is removed
Background
Should not be necessary for the Harvesting MVP, but will be required at some point in the future.
Security Considerations (required)
None.
Sketch
These are the expected process for production. This may or may not should be made configurable for testing purposes and future flexibility:
- Never remove harvest_source or harvest_job records
- Remove harvest_object and harvest_object_error records, where the records satisfy the following:
- Greater than 90 days old
- Are no longer being used by catalog (ie have been overwritten by another harvest_object)
The no longer being used by catalog is a hard requirement and will always be in effect, the 90 day mark should be configurable. Ideally this will be a task that can be run on a regular basis, something like weekly/monthly.