SCOTUS Blocked
HTTPError: 403 Client Error: Forbidden for url: http://www.supremecourt.gov/oral_arguments/argument_audio.aspx
Sentry Issue: COURTLISTENER-6RW
HTTPError: 403 Client Error: Forbidden for url: http://www.supremecourt.gov/oral_arguments/argument_audio.aspx
(1 additional frame(s) were not displayed)
...
File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 385, in handle
self.parse_and_scrape_site(mod, options["full_crawl"])
File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 348, in parse_and_scrape_site
site = mod.Site().parse()
All of our SCOTUS scrapers appear to be blocked.
They all return 403 unless I change the user agent
Greaaat. Do you have an open line of comms with them?
I do not - I reached out via their webmaster form last week - to ask about subscribing to all RSS feeds - and have not heard back. I wasn't expecting to hear until today. But seems unfortunately timed.
I was going thru my mental rolodex at the moment to figure out who to contact first.
I started https://github.com/freelawproject/crm/issues/423 so we can keep notes about our contacts over there.
Two things are occurring -
- Certain User Agents are blocked from the court.
Juriscraperis one. - Our IP address is blocked from accessing the court website.
It's not just us - others are seeing similar blocks on user agents. This affects our oral argument scraper (which was expected).
Reports that this will be resolved by end of day today. (we shall see)
Looking at oral argument on our site, it seems like it's not yet resolved or our scraper didn't work. Our latest:
Vs. the SCOTUS website:
@grossir can you investigate
We checked and had some extra info on the ErrorLog table on the DB. Errors look like this:
'"text/html" not in ['audio/mpeg']>, <ErrorLog: 2024-02-29 20:31:04.085362+00:00 - WARNING@scotus UnexpectedContentTypeError: http://www.supremecourt.gov/media/audio/mp3files/22-704.mp3
This doesn't reproduce locally (even when using User-Agent: CourtListener), so maybe it some kind of IP redirection/blocking? Anyway to inspect it on the prod server?
Note that this issue wasn't caused by adding the expected_content_type checks, which happened on Feb 9. We have oral arguments as recent as Feb 21st; which make me thinks it has more to do with the recent blockings from SCOTUS
I can check on the server if you've got some commands for me to run.
If you can test this in a django shell on the server, this is a minimal example that mimics cl_scrape_oral_args. So we can peek into the request:
import requests
from juriscraper.oral_args.united_states.federal_appellate.scotus import Site
site = Site()
site.parse()
download_url = site[0]['download_urls']
headers={"User-Agent": "CourtListener"}
s = requests.session()
r = s.get(download_url,verify=False,headers=headers,cookies=site.cookies,timeout=300)
print(r.headers)
print(r.content[:100])
I think the page maybe redirecting to an error/blocked HTML page, we may see it with this code.
Otherwise you could simply try a wget from the server and see what happens
wget http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
I get:
The 301 is from sending a http request, it redirects to the https version
$ wget http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
--2024-03-01 15:54:36-- http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Resolving www.supremecourt.gov (www.supremecourt.gov)... 23.50.113.144, 23.50.113.134
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.50.113.144|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3 [following]
--2024-03-01 15:54:37-- https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.50.113.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43409970 (41M) [audio/mpeg]
Saving to: ‘22-976.mp3’
22-976.mp3 100%[=============================================================================>] 41.40M 10.0MB/s in 4.5s
2024-03-01 15:54:42 (9.20 MB/s) - ‘22-976.mp3’ saved [43409970/43409970]
Hm, the first one seems OK:
In [1]: import requests
In [2]: from juriscraper.oral_args.united_states.federal_appellate.scotus import Site
In [3]:
In [3]: site = Site()
In [4]: site.parse()
Out[4]: <juriscraper.oral_args.united_states.federal_appellate.scotus.Site at 0x7fd716e941d0>
In [5]: download_url = site[0]['download_urls']
In [6]: headers={"User-Agent": "CourtListener"}
In [7]: s = requests.session()
In [8]: r = s.get(download_url,verify=False,headers=headers,cookies=site.cookies,timeout=300)
In [9]: print(r.headers)
{'Accept-Ranges': 'bytes', 'Content-Type': 'audio/mpeg', 'ETag': '"513708ca9a4bfc5d26024fe4d800a378:1709145623.368223"', 'Last-Modified': 'Wed, 28 Feb 2024 18:38:09 GMT', 'Server': 'AkamaiNetStorage', 'Content-Length': '43409970', 'Date': 'Fri, 01 Mar 2024 22:18:36 GMT', 'Connection': 'keep-alive', 'Server-Timing': 'cdn-cache; desc=HIT, edge; dur=1, ak_p; desc="1709331516777_388897933_83893725_29_7683_22_9_-";dur=1', 'content-disposition': 'attachment; filename="22-976.mp3"', 'Strict-Transport-Security': 'max-age=31536000'}
In [10]: print(r.content[:100])
b"ID3\x03\x00\x00\x00\x007(TIT2\x00\x00\x00*\x00\x00\x00(22-976) Garland, Att'y Gen. v. Cargill \x00TYER\x00\x00\x00\x06\x00\x00\x002024\x00TDAT\x00\x00\x00\x06\x00\x00\x000000\x00PRIV\x00\x00"
wget returns:
root@maintenance-ml:/opt/courtlistener# wget http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
--2024-03-01 22:19:51-- http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Resolving www.supremecourt.gov (www.supremecourt.gov)... 23.46.17.13, 23.46.17.31, 2600:1405:1800::6867:44d9, ...
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.46.17.13|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3 [following]
--2024-03-01 22:19:51-- https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.46.17.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43409970 (41M) [audio/mpeg]
Saving to: ‘22-976.mp3’
22-976.mp3 100%[======================================================================================================================================================================================================>] 41.40M 176MB/s in 0.2s
2024-03-01 22:19:52 (176 MB/s) - ‘22-976.mp3’ saved [43409970/43409970]
Seems OK... have we tried re running the scraper? Maybe it runs now?
Doesn't seem to have gotten working yet:
https://www.courtlistener.com/?type=oa&q=&type=oa&order_by=dateArgued%20desc&court=scotus
We'll have to check in on Monday, when Bill can lend a hand.
hi guys- lets get this fixed! as everyone is looking for our stuff
Hm... not sure what was going on but I decided to run a --fullcrawl
www-data@maintenance-wp:/opt/courtlistener$ python manage.py cl_scrape_oral_arguments --courts juriscraper.oral_args.united_states.federal_appellate.scotus --verbosity 2 --fullcrawl INFO Starting up the scraper. INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3' INFO Successfully added audio file 90881: b"Garland, Att'y Gen. v. Cargill" INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/23-3.mp3' INFO Successfully added audio file 90882: b'Coinbase, Inc. v. Suski' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-7386.mp3' INFO Successfully added audio file 90883: b'McIntosh v. United States' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-529.mp3' INFO Successfully added audio file 90884: b'Cantero v. Bank of America, N.A.' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-555.mp3' INFO Successfully added audio file 90885: b'NetChoice, LLC v. Paxton' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-277.mp3' INFO Successfully added audio file 90886: b'Moody v. NetChoice, LLC' ...
and that appears to be enough for now- worth keep an eye on it -