warn-scraper icon indicating copy to clipboard operation
warn-scraper copied to clipboard

WIP PA scraper

Open chriszs opened this issue 1 year ago • 5 comments

Draft PR to add a scraper for Pennsylvania.

Incorporates and builds upon @Ash1R's fixes and @stucka's edits from #517 by cherry picking their commits.

Steps to test

python -m warn.cli PA

Closes #374

chriszs avatar Mar 10 '24 13:03 chriszs

Still some data quality issues to resolve:

Screenshot 2024-03-10 at 9 30 29 PM Screenshot 2024-03-10 at 9 30 47 PM

But we're inching closer.

chriszs avatar Mar 10 '24 13:03 chriszs

The HTML parser trashes the p tags but I'm wondering if that might be contributing to some of the problems here? https://www.dli.pa.gov/Individuals/Workforce-Development/warn/notices/Pages/April-2020.aspx

In April 2020, for example, I see the final p tag with some additional markup contains the number of layoffs and such; earlier p tags contain the individual locations. Parsing those as distinct entities may make it easier to handle the fields with distinction.

There's a tactical question here about how to handle this when there are multiple locations but a single group summary, particularly with the number of layoffs. (Perhaps the group as one line in the CSV; Perhaps one-location-per-row, but prefix "GROUP: " before the parsed layoff, then clean up text in transformer?)

stucka avatar Mar 10 '24 14:03 stucka

There's also a new (and probably terribly conceived) function with utils.fetch_if_not_cached that might make sense to use for maybe all but the three newest URLs. so we're not hitting dozens of quite old files several times a day. If adapted into the existing workflow, you'd have to fetch the three newest each time, and add their content to output_rows; then for the others not in the three newest determine the filename and URL, then run utils.fetch-if_not_cached and cache.read to get stuff into output_rows. But there'd be a lot less stuff in motion for repeat runs.

stucka avatar Mar 10 '24 14:03 stucka

I tweaked the filing handling a little (e.g., March 2024 would never have redownloaded, and cached files were getting rewritten) but nothing else. I have not looked closely at the parsing.

stucka avatar Mar 11 '24 00:03 stucka

Thanks! It's not ready for review yet.

chriszs avatar Mar 11 '24 14:03 chriszs

Closing again because I haven't had a chance to work on this.

chriszs avatar Sep 08 '24 09:09 chriszs