data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

refactor harvesting logic repo

Open rshewitt opened this issue 9 months ago • 1 comments

User Story

In order to slim down the harvesting logic repo, datagov wants to remove/replace some of the harvester codebase

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • [ ] GIVEN harvesting logic repo
    WHEN code has been refactored
    THEN it will no longer be present in the repo.

Background

  • we're moving forward with using the HarvestRecord table as our source for comparison instead of ckan.
  • we don't need any S3 logic
    • if we pursue using ckan-solr for our comparison in the future having s3 functionality would prove useful. we can re-add it when we need it.
  • we don't need functionality to read records from ckan
  • the HarvestSource class no longer needs many of its required init args because it will derive those attributes from the harvest source info it requests
  • this ticket won't remove functionality for creating, updating, or deleting packages from ckan

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • items for removal/replacement in background

rshewitt avatar Apr 29 '24 19:04 rshewitt

here's a list of things i've done so far. haven't pushed anything.

  • harvest.py
    • rewording things to describe "our" data vs "their" data
    • harvest source can describe a table we manage in this process vs a source of information we harvest external to us. same thing with harvest record.
    • isolated some functions as utils
    • removed a handful of the required args ( they get assigned after getting the harvest source info from the db )
  • moved all database stuff into a separate folder
  • created an example data folder
  • created a schemas folder
  • took out mdtranslator and localstack services from docker compose
  • moved nginx config to root
  • moved database tests to unit test folder
  • consolidated extract tests into one module

rshewitt avatar May 01 '24 14:05 rshewitt

rewriting tests to work with the refactoring. completed the following unit test categories:

  • cf
  • compare
  • database
  • exception
  • extract

remaining:

  • load
  • utils
  • validate
  • all integration tests

rshewitt avatar May 03 '24 15:05 rshewitt