data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

Harvest2.0 DB cleaning

Open jbrown-xentity opened this issue 8 months ago • 1 comments

User Story

In order to keep the harvesting2.0 DB light and not fill up indefinitely, data.gov admins want a script to cleanup unnecessary data from the DB without losing current information in use.

Acceptance Criteria

  • [ ] GIVEN the harvesting DB is unnecessarily large (extra harvest records) WHEN the DB cleaning script is run
    THEN the desired data is removed

Background

Should not be necessary for the Harvesting MVP, but will be required at some point in the future.

Security Considerations (required)

None.

Sketch

These are the expected process for production. This may or may not should be made configurable for testing purposes and future flexibility:

  • Never remove harvest_source or harvest_job records
  • Remove harvest_object and harvest_object_error records, where the records satisfy the following:
    • Greater than 90 days old
    • Are no longer being used by catalog (ie have been overwritten by another harvest_object)

The no longer being used by catalog is a hard requirement and will always be in effect, the 90 day mark should be configurable. Ideally this will be a task that can be run on a regular basis, something like weekly/monthly.

jbrown-xentity avatar Jun 03 '24 18:06 jbrown-xentity