data.gov
data.gov copied to clipboard
refactor harvesting logic repo
User Story
In order to slim down the harvesting logic repo, datagov wants to remove/replace some of the harvester codebase
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
- [ ] GIVEN harvesting logic repo
WHEN code has been refactored
THEN it will no longer be present in the repo.
Background
- we're moving forward with using the
HarvestRecord
table as our source for comparison instead of ckan. - we don't need any S3 logic
- if we pursue using ckan-solr for our comparison in the future having s3 functionality would prove useful. we can re-add it when we need it.
- we don't need functionality to read records from ckan
- the
HarvestSource
class no longer needs many of its required init args because it will derive those attributes from the harvest source info it requests - this ticket won't remove functionality for creating, updating, or deleting packages from ckan
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
- items for removal/replacement in background
here's a list of things i've done so far. haven't pushed anything.
- harvest.py
- rewording things to describe "our" data vs "their" data
- harvest source can describe a table we manage in this process vs a source of information we harvest external to us. same thing with harvest record.
- isolated some functions as utils
- removed a handful of the required args ( they get assigned after getting the harvest source info from the db )
- moved all database stuff into a separate folder
- created an example data folder
- created a schemas folder
- took out mdtranslator and localstack services from docker compose
- moved nginx config to root
- moved database tests to unit test folder
- consolidated extract tests into one module
rewriting tests to work with the refactoring. completed the following unit test categories:
- cf
- compare
- database
- exception
- extract
remaining:
- load
- utils
- validate
- all integration tests