scrapers
scrapers copied to clipboard
Extraction Intake
A process which, when run, submits a scraper’s Extraction and metadata to our database.
For now, we're going to use CKAN instead of making our own API from scratch.
Key user story
As a data scraping volunteer, I should be able to run a Scraper from the Scrapers repo and submit the Extraction to PDAP.
Details
We need a place to put Extractions and their Metadata. Once the Extraction is dropped, we should link to its path in the data_intake
database.
The simplest, most modern solution is probably an API endpoint.
What's in an Extraction?
The goal: a synchronous bright line
between the source material and the scraped result, with the source code thrown in. We can publish these on the website as case studies
without fear of legal trouble.
- an extraction of "raw files", i.e. no OCR or translation
- a metadata.json file
- the
scraper.py
code itself (nice to have)- this could point at github
- we don't technically need this as long as we have time stamped version history in github, though that is tougher to untangle and troubleshoot and not as standalone
Visual aid
https://pdap.invisionapp.com/freehand/Data-intake-flow-Q01qjpCvN
To do
- [x] #135
- [x] #137
- [x] #134
- [ ] https://github.com/police-data-accessibility-project/planning/issues/161
- [ ] https://github.com/police-data-accessibility-project/pdap-scrapers/issues/154 (closed by #181)
- [ ] https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/issues/180 (closed by #181)
- [ ] https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/issues/173
- [ ] https://github.com/police-data-accessibility-project/pdap-scrapers/issues/153