openstates-scrapers icon indicating copy to clipboard operation
openstates-scrapers copied to clipboard

Script to automatically scrape all people

Open johnseekins opened this issue 2 years ago • 1 comments

The primary goal here is a simple script that scrapes all jurisdictions and creates a PR with all relevant changes to the people repo.

This PR also fixes many scraper issues: Some jurisdictions require a custom OpenSSL config to ensure downgrades work correctly (at least CA and FL) AL fix URL escaping in names and timeouts CA Assembly wasn't getting correct address/phone number information CO timeouts and corrected number of items in DOM DC handle empty fax number DE timeouts FL retired members weren't handled correctly, CNAME redirect removed GA timeouts, retired members, and handle emails correctly HI missing variable in FormSource object IL address matching was occasionally broken IN timeouts and Democratic Senate index page has significantly changed KY timeouts LA empty phone numbers, compile regexes, names had gotten mangled on site, and timeouts MA timeouts ME handle missing images for Senators MI certificate was broken (verify=False) MN certificate for senate site was broken entirely (added verify=False) NC handle empty emails ND number of members and empty email NE was very broken, not actually collects legislators correctly NH didn't handle Independent party members OH timeouts and add senate OK didn't handle empty seats PR senate scrape was completely broken SC fix timeouts (needed 60 seconds and retries...), add Independents, and remove www from URLs (seems more stable) TN remove un-needed else block and handle broken legislators better WA fix timeouts WI didn't handle vacancies properly and compile regex once

NY senate scraping is an interesting problem. The basics are there, but we can't easily collect contact information. Because of this, I didn't enable scrapers_next.ny.people.Senate in jurisdiction_configs.json. If we're okay with that compromise, it's easy to enable NY senate collection.

To make looping through all the jurisdiction people scraping easier, I added a file called jurisdiction_configs.json. This file is essentially a definition mapping between the jurisdiction and the actual library functions we need to call to scrape that jurisdiction. Because we're specifically looking for "people" in the associated script, we should be able to easily extend this for committee/vote/etc. data if needed.

One of the big changes is adding a custom openssl.cnf to let us handle errors like these:

curl -vvI https://www.assembly.ca.gov/assemblymembers
*   Trying 192.234.214.84:443...
* Connected to www.assembly.ca.gov (192.234.214.84) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/pki/tls/certs/ca-bundle.crt
*  CApath: none
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (OUT), TLS header, Unknown (21):
* TLSv1.2 (OUT), TLS alert, handshake failure (552):
* error:0A000152:SSL routines::unsafe legacy renegotiation disabled
* Closing connection 0
curl: (35) error:0A000152:SSL routines::unsafe legacy renegotiation disabled

Some jurisdictions (not only California) have poorly configured or old TLS certs. This was less of a problem with OpenSSL 1.1, but as more systems move to OpenSSL 3.x, we'll see more problems here. Adding this config lets us properly upgrade OpenSSL/etc. without breaking scraping.

johnseekins avatar Jun 23 '22 15:06 johnseekins

The missing steps within the script itself is commtting, pushing, and triggering the PR. That part is going to take some additional testing. But I wanted to get some eyes on the other changes this PR makes.

johnseekins avatar Jun 24 '22 00:06 johnseekins