openstates-scrapers
openstates-scrapers copied to clipboard
Script to automatically scrape all people
The primary goal here is a simple script that scrapes all jurisdictions and creates a PR with all relevant changes to the people repo.
This PR also fixes many scraper issues:
Some jurisdictions require a custom OpenSSL config to ensure downgrades work correctly (at least CA and FL)
AL fix URL escaping in names and timeouts
CA Assembly wasn't getting correct address/phone number information
CO timeouts and corrected number of items in DOM
DC handle empty fax number
DE timeouts
FL retired members weren't handled correctly, CNAME redirect removed
GA timeouts, retired members, and handle emails correctly
HI missing variable in FormSource object
IL address matching was occasionally broken
IN timeouts and Democratic Senate index page has significantly changed
KY timeouts
LA empty phone numbers, compile regexes, names had gotten mangled on site, and timeouts
MA timeouts
ME handle missing images for Senators
MI certificate was broken (verify=False
)
MN certificate for senate site was broken entirely (added verify=False
)
NC handle empty emails
ND number of members and empty email
NE was very broken, not actually collects legislators correctly
NH didn't handle Independent party members
OH timeouts and add senate
OK didn't handle empty seats
PR senate scrape was completely broken
SC fix timeouts (needed 60 seconds and retries...), add Independents, and remove www from URLs (seems more stable)
TN remove un-needed else
block and handle broken legislators better
WA fix timeouts
WI didn't handle vacancies properly and compile regex once
NY senate scraping is an interesting problem. The basics are there, but we can't easily collect contact information. Because of this, I didn't enable scrapers_next.ny.people.Senate
in jurisdiction_configs.json
. If we're okay with that compromise, it's easy to enable NY senate collection.
To make looping through all the jurisdiction people scraping easier, I added a file called jurisdiction_configs.json
. This file is essentially a definition mapping between the jurisdiction and the actual library functions we need to call to scrape that jurisdiction.
Because we're specifically looking for "people" in the associated script, we should be able to easily extend this for committee/vote/etc. data if needed.
One of the big changes is adding a custom openssl.cnf
to let us handle errors like these:
curl -vvI https://www.assembly.ca.gov/assemblymembers
* Trying 192.234.214.84:443...
* Connected to www.assembly.ca.gov (192.234.214.84) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
* CApath: none
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (OUT), TLS header, Unknown (21):
* TLSv1.2 (OUT), TLS alert, handshake failure (552):
* error:0A000152:SSL routines::unsafe legacy renegotiation disabled
* Closing connection 0
curl: (35) error:0A000152:SSL routines::unsafe legacy renegotiation disabled
Some jurisdictions (not only California) have poorly configured or old TLS certs. This was less of a problem with OpenSSL 1.1, but as more systems move to OpenSSL 3.x, we'll see more problems here. Adding this config lets us properly upgrade OpenSSL/etc. without breaking scraping.
The missing steps within the script itself is commtting, pushing, and triggering the PR. That part is going to take some additional testing. But I wanted to get some eyes on the other changes this PR makes.