scrapers icon indicating copy to clipboard operation
scrapers copied to clipboard

Add scrapers style requirements to readme / templates

Open josh-chamberlain opened this issue 3 years ago • 0 comments

The task:

  • [ ] Represent these requirements in the scrapers readme or template as appropriate
  • [ ] Represent them by creating an example scraper that meets the criteria

Good scrapers:

  • Scraper must be able to pick up where it left off, i.e., not a complete grab each time, only the differences since the last run.
  • Scraper saves file to our Hadoop.
  • Scraper saves metadata to our database (Dolt or PostgreSQL)
  • Scraper to produce a SHA256 and MD5 hash for every file it generates and record it in database. A separate script can be used for this. Workflow would be something like scraper>extractor>saver

Questions:

  • Where would they save the keys? Keys or Developer API tokens, similar to those Github or other cloud services uses can be stored in config file of the individual scraper.

  • Does the script have to generate its own key? We generate them on the server and assign to scrapers.

  • Do all the scrapers just use a common key that is located on the scraping server? Each scraper will have its own.

josh-chamberlain avatar Jun 07 '21 13:06 josh-chamberlain