Discontinue old CA scraping

Open stucka opened this issue 2 years ago • 1 comments

CA scraper is parsing PDFs from 2015, and not surprisingly is the slowest-running scraper of the bunch.

Aug 10 '23 19:08 stucka

I wonder if:

The scraper could be sped up/cleaned up.
Whether there could be a way to archive the data from older years so it is retained, but we don't have to continually re-scrape it. There's precedent for hosting a spreadsheet file somewhere static, perhaps on BigLocalNews somewhere, which the scraper just pulls and integrates. That stuff doesn't change much, but archival data is still good to have. The oldest states go back to when the WARN Act first took effect in 1989 and I started thinking of completeness in terms of not just states but also years (I was shooting for seven years of coverage based on what seemed achievable to get for historical comparison) and people (because 50 states isn't possible, so you'll want to be able to say it covers 9X% of the U.S. population), as well as percentage of job loss overall.

Sep 13 '23 05:09 chriszs