scrapers
scrapers copied to clipboard
Add scrapers style requirements to readme / templates
The task:
- [ ] Represent these requirements in the scrapers readme or template as appropriate
- [ ] Represent them by creating an example scraper that meets the criteria
Good scrapers:
- Scraper must be able to pick up where it left off, i.e., not a complete grab each time, only the differences since the last run.
- Scraper saves file to our Hadoop.
- Scraper saves metadata to our database (Dolt or PostgreSQL)
- Scraper to produce a SHA256 and MD5 hash for every file it generates and record it in database.
A separate script can be used for this. Workflow would be something like
scraper>extractor>saver
Questions:
-
Where would they save the keys? Keys or Developer API tokens, similar to those Github or other cloud services uses can be stored in config file of the individual scraper.
-
Does the script have to generate its own key? We generate them on the server and assign to scrapers.
-
Do all the scrapers just use a common key that is located on the scraping server? Each scraper will have its own.