warn-scraper
warn-scraper copied to clipboard
Ms scraper #373
This is for issue #373 , to add a scraper for the Mississippi (never spelt that one wrong...).
Works correctly, but around 8 rows have two of the values switched, all for the same reason. Should I fix that or leave it for downstream?
My bad, done! I copied the current mi.py code.
Triggering tests by closing and reopening.
OK, so for the record I've done some terrible things to @Ash1R 's draft, and hope to do more soon and get this into production.
- Realized naming scheme for HTML in cache was bad, then realized caching the HTML was unnecessary.
- Realized caching of the PDFs was faulty -- it'd cache every page once, so the first layoff of a quarter would be seen as the complete quarter. I'm hoping to shift to older PDFs into an exported historical CSV to make everything run that much faster. That's partially implemented but we still need clean CSV exports from which to generate that data. We're close there.
- Realized some of the PDF data was coming through with bad or weird data -- line breaks in the middle of company names, odd unicode dashes. Wrote a simple function to clean that up.
To-do:
- [x] Improve logging
- [x] Revamp Excel logic -- skipping some cells is disastrous, as with Aramark March 2019 line getting merged with Palmer House April 2019. Might be something as basic as, if the last cell is empty, then drop in a null value, add the row, and we'll get on with life. 8/3/2012 and 11/21/2013 are going to be a useful edge case for validation.
- [ ] Implement historical data download and import
- [x] Implement record duplicate checker
- [x] Evaluate whether it's possible to extract out city and county from this thing and if so, implement it. A sample suggests it's always that last line, typically with a city name, sometimes a garbage character, county in parenthesis, sometimes a ZIP code. Is county good enough?
- [ ] Build out a transformer. Note bad date with a year of 208.
- [x] Improve text cleanup -- Ruth's Chris Steak House with odd apostrophe. P.F. Chang's.
- [ ] Evaluate a number of rows from 2013-2015 with fields swapped around, and at least one with an extra field. Historical data can be patched up before export, but getting the scraper to work on them could improve future effectiveness.
- [ ] Consider dropping pre-2016 rows if cleanup seems too unwieldy, then flag as an issue.
@Ash1R , I've got a bunch more validation in the scraper. I incorporated the fixes made by @jsvine but then had to go farther off the reservation to patch an even weirder PDF. Still need to set up some of the historical data but first need to do some validation of the CSV. Looks like it picked up about 30 more rows than you were getting, which is ... weird.
Seeing some data integrity problems with edge cases that bump up against the logic of "every other row has the layoff number" kind of thing. A good example: https://mdes.ms.gov/media/26893/PY2011_Q1_WARN_July2011_Sep2011.pdf
Another way to handle this might be able to split the rows up into sections (e.g., every section must have a "/" in the first cell of the first row, to show a date). That's likely overkill.
The PDF parsing is still failing in some interesting ways. I tried to get the historical data cleaned up but found most of a page missing, e.g., 152801_py2018_q4_warn_apr2019_jun2019.pdf
I tweaked a couple things in the Python to try to improve logging and readability in the output, but it does not affect the substance, only the sort order.
Somewhat patched CSV: ms.csv
Note pages set to "manual," which I only started after patching some in 2013-2015.