warn-scraper icon indicating copy to clipboard operation
warn-scraper copied to clipboard

MO blocking access with Incapsula/Imperva

Open Kirkman opened this issue 1 year ago • 4 comments

Hey guys, I have my own tools for archiving WARN data from Missouri, and noticed today that jobs.mo.gov has begun using Incapsula/Imperva.

So I tried installing and using warn-scraper, to check if it had the same problem.

It does. The CSV generated by running warn-scraper MO is blank.

I tried with debug logging enabled, which yielded this output:

2024-01-05 14:38:35,648 - warn.utils - Requesting https://jobs.mo.gov/warn/2024
2024-01-05 14:38:35,965 - warn.utils - Response code: 200
2024-01-05 14:38:35,965 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2024.html
2024-01-05 14:38:35,965 - warn.utils - Requesting https://jobs.mo.gov/warn/2023
2024-01-05 14:38:36,159 - warn.utils - Response code: 200
2024-01-05 14:38:36,159 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2023.html
2024-01-05 14:38:36,159 - warn.utils - Requesting https://jobs.mo.gov/warn/2022
2024-01-05 14:38:36,350 - warn.utils - Response code: 200
2024-01-05 14:38:36,365 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2022.html
2024-01-05 14:38:36,365 - warn.utils - Requesting https://jobs.mo.gov/warn/2021
2024-01-05 14:38:36,520 - warn.utils - Response code: 200
2024-01-05 14:38:36,520 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2021.html
2024-01-05 14:38:36,536 - warn.utils - Requesting https://jobs.mo.gov/warn/2020
2024-01-05 14:38:36,704 - warn.utils - Response code: 200
2024-01-05 14:38:36,704 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2020.html
2024-01-05 14:38:36,704 - warn.utils - Requesting https://jobs.mo.gov/warn/2019
2024-01-05 14:38:36,867 - warn.utils - Response code: 200
2024-01-05 14:38:36,867 - warn.cache - Writing to cache C:\BLAHBLAH\.warn-scraper\cache\mo\2019.html
2024-01-05 14:38:36,867 - warn.scrapers.mo - 6 pages downloaded
2024-01-05 14:38:36,867 - warn.scrapers.mo - Parsing page #1
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,882 - warn.scrapers.mo - Parsing page #2
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,882 - warn.scrapers.mo - Parsing page #3
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,882 - warn.scrapers.mo - Parsing page #4
2024-01-05 14:38:36,882 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,888 - warn.scrapers.mo - Parsing page #5
2024-01-05 14:38:36,888 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,888 - warn.scrapers.mo - Parsing page #6
2024-01-05 14:38:36,888 - warn.scrapers.mo - No tables found
2024-01-05 14:38:36,888 - warn.utils - Writing 0 rows to mo.csv
2024-01-05 14:38:36,888 - warn.runner - Generated mo.csv

Inspecting each of those HTML files shows that Incapsula/Imperva is preventing the script from accessing the actual page:

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=14-115740487-0%200NNN%20RT%281704487115115%2090%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U18&incident_id=6520000580389742541-646795442112695950&edet=12&cinfo=04000000&rpinfo=0&cts=eQEONzyby1MvlANu2GNml3u10DCjkJcgzRC27ORflxidPXGdgOlaDqqUf3PihZHs&mth=GET" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 6520000580389742541-646795442112695950</iframe></body></html>

Kirkman avatar Jan 05 '24 20:01 Kirkman

For some reason, Missouri is now getting scraped in our Github Actions workflow, but is not scraping for me at home, though Incapsula/Imperva is supposed to block data centers and allow residential addresses.

stucka avatar Jan 16 '24 23:01 stucka

At least one more data point that the scraper is working through Github Actions. I'm going to close this now, but please reopen it if you see a problem. I still want to build more automated QA around scrapers in general.

stucka avatar Jan 31 '24 14:01 stucka

@Kirkman has confirmed in another venue he continues to get blocked from two IP addresses, and I got blocked from one of two addresses I tested. Reopening.

stucka avatar Jan 31 '24 15:01 stucka

I've seen some intermittent fails but it's generally working.

The idea of working with Google's cache files for a suboptimal workaround is no longer even suboptimal, as that's no longer available. I have documentation somewhere on scraping archive snapshots from Bing. Le sigh.

stucka avatar Mar 10 '24 13:03 stucka