warn-scraper icon indicating copy to clipboard operation
warn-scraper copied to clipboard

Fix NY: Pull in data from state's current data tables

Open chriszs opened this issue 3 years ago • 5 comments

NY scraper is just pulling from archival data, as documented in #242. Current website lists dates, company name for some recent years in tables. Locations, job numbers would have to be pulled from semi-structured individual PDFs (at least one of which was scanned, but that one seemed decently OCRed). They mention future plans to improve this, but who knows when.

chriszs avatar Feb 20 '22 16:02 chriszs

Looking at our static historical file, it appears to run from the start of 2016 until June 30, 2021.

My proposed short term improvement to our coverage for this state would be to write HTML scraper for the current site that gathers everything since July 1, 2021 and merges the two files together.

palewire avatar Mar 31 '22 22:03 palewire

In #475 I added some new code to download and parse the HTML tables now available on the state website. While this is progress and will ensure that we are able to discover and alert the most recent new filings going forward, there are several shortcomings that still need to be addressed:

  • The code does not properly infer the current year, which will likely result in a gap when the page is updated for 2023. We'll need to loop back and fix that in the future.
  • The columns provided are clearly incomplete. There appears to be only the name of the company, the notice date and then a bureaucratic "posted date." That means we are missing two crucial fields: The number of jobs lost and the "effective date" when the losses are expected to happen. We should continue to press on the state to improve what they post.

palewire avatar Mar 31 '22 22:03 palewire

Alternative would be to parse the PDFs. They seem semi-structured.

chriszs avatar Mar 31 '22 23:03 chriszs

Number #477 makes good progress here, but I think it does result in some duplicates we should think through before we ship.

palewire avatar Jul 27 '22 15:07 palewire

I'd suggest just establishing a cut-off date to truncate one, keeping whichever one seems most complete in the overlap period.

chriszs avatar Sep 13 '23 05:09 chriszs