scrapers icon indicating copy to clipboard operation
scrapers copied to clipboard

Extraction Intake

Open CaptainStabs opened this issue 3 years ago • 0 comments

A process which, when run, submits a scraper’s Extraction and metadata to our database.

For now, we're going to use CKAN instead of making our own API from scratch.

Key user story

As a data scraping volunteer, I should be able to run a Scraper from the Scrapers repo and submit the Extraction to PDAP.

Details

We need a place to put Extractions and their Metadata. Once the Extraction is dropped, we should link to its path in the data_intake database.

The simplest, most modern solution is probably an API endpoint.

What's in an Extraction?

The goal: a synchronous bright line between the source material and the scraped result, with the source code thrown in. We can publish these on the website as case studies without fear of legal trouble.

  • an extraction of "raw files", i.e. no OCR or translation
  • a metadata.json file
  • the scraper.py code itself (nice to have)
    • this could point at github
    • we don't technically need this as long as we have time stamped version history in github, though that is tougher to untangle and troubleshoot and not as standalone

Visual aid

https://pdap.invisionapp.com/freehand/Data-intake-flow-Q01qjpCvN

To do

  • [x] #135
  • [x] #137
  • [x] #134
  • [ ] https://github.com/police-data-accessibility-project/planning/issues/161
  • [ ] https://github.com/police-data-accessibility-project/pdap-scrapers/issues/154 (closed by #181)
  • [ ] https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/issues/180 (closed by #181)
  • [ ] https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/issues/173
  • [ ] https://github.com/police-data-accessibility-project/pdap-scrapers/issues/153

CaptainStabs avatar Apr 22 '21 12:04 CaptainStabs