MO blocking access with Incapsula/Imperva
Hey guys, I have my own tools for archiving WARN data from Missouri, and noticed today that jobs.mo.gov has begun using Incapsula/Imperva.
So I tried installing and using warn-scraper, to check if it had the same problem.
It does. The CSV generated by running warn-scraper MO is blank.
I tried with debug logging enabled, which yielded this output:
2024-01-05 14:38:35,648 - warn.utils - Requesting https://jobs.mo.gov/warn/2024
2024-01-05 14:38:35,965 - warn.utils - Response code: 200
2024-01-05 14:38:35,965 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2024.html
2024-01-05 14:38:35,965 - warn.utils - Requesting https://jobs.mo.gov/warn/2023
2024-01-05 14:38:36,159 - warn.utils - Response code: 200
2024-01-05 14:38:36,159 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2023.html
2024-01-05 14:38:36,159 - warn.utils - Requesting https://jobs.mo.gov/warn/2022
2024-01-05 14:38:36,350 - warn.utils - Response code: 200
2024-01-05 14:38:36,365 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2022.html
2024-01-05 14:38:36,365 - warn.utils - Requesting https://jobs.mo.gov/warn/2021
2024-01-05 14:38:36,520 - warn.utils - Response code: 200
2024-01-05 14:38:36,520 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2021.html
2024-01-05 14:38:36,536 - warn.utils - Requesting https://jobs.mo.gov/warn/2020
2024-01-05 14:38:36,704 - warn.utils - Response code: 200
2024-01-05 14:38:36,704 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2020.html
2024-01-05 14:38:36,704 - warn.utils - Requesting https://jobs.mo.gov/warn/2019
2024-01-05 14:38:36,867 - warn.utils - Response code: 200
2024-01-05 14:38:36,867 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2019.html
2024-01-05 14:38:36,867 - warn.scrapers.mo - 6 pages downloaded
2024-01-05 14:38:36,867 - warn.scrapers.mo - Parsing page #1
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,882 - warn.scrapers.mo - Parsing page #2
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,882 - warn.scrapers.mo - Parsing page #3
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,882 - warn.scrapers.mo - Parsing page #4
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,888 - warn.scrapers.mo - Parsing page #5
2024-01-05 14:38:36,888 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,888 - warn.scrapers.mo - Parsing page #6
2024-01-05 14:38:36,888 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,888 - warn.utils - Writing 0 rows to mo.csv
2024-01-05 14:38:36,888 - warn.runner - Generated mo.csv
Inspecting each of those HTML files shows that Incapsula/Imperva is preventing the script from accessing the actual page:
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=14-115740487-0%200NNN%20RT%281704487115115%2090%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U18&incident_id=6520000580389742541-646795442112695950&edet=12&cinfo=04000000&rpinfo=0&cts=eQEONzyby1MvlANu2GNml3u10DCjkJcgzRC27ORflxidPXGdgOlaDqqUf3PihZHs&mth=GET" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 6520000580389742541-646795442112695950</iframe></body></html>
For some reason, Missouri is now getting scraped in our Github Actions workflow, but is not scraping for me at home, though Incapsula/Imperva is supposed to block data centers and allow residential addresses.
At least one more data point that the scraper is working through Github Actions. I'm going to close this now, but please reopen it if you see a problem. I still want to build more automated QA around scrapers in general.
@Kirkman has confirmed in another venue he continues to get blocked from two IP addresses, and I got blocked from one of two addresses I tested. Reopening.
I've seen some intermittent fails but it's generally working.
The idea of working with Google's cache files for a suboptimal workaround is no longer even suboptimal, as that's no longer available. I have documentation somewhere on scraping archive snapshots from Bing. Le sigh.