covid-data-pipeline
covid-data-pipeline copied to clipboard
Scan/Trim/Extra Pipeline for State Coronavirus Site
corona19-data-pipeline
Scan/Trim/Extract Pipeline for Coronavirus Site
- The code now expects to be run from the root directory of the repo. *
- This includes IDEs like VS Code. *
Scanner
- Gets the data from urls in google sheet.
- Pulls the raw HTML
- Creates a clean version without the markup
- Push it into a github repo.
Backup To S3
- pulls an image for each page
- pushed it to an S3 bucket
Specialized_Capture
- Fire up a captive browser
- For a list of urls, take a screen shot
- If they change, push them into git