juriscraper SCOTUS Blocked

HTTPError: 403 Client Error: Forbidden for url: http://www.supremecourt.gov/oral_arguments/argument_audio.aspx

HTTPError: 403 Client Error: Forbidden for url: http://www.supremecourt.gov/oral_arguments/argument_audio.aspx
(1 additional frame(s) were not displayed)
...
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 385, in handle
    self.parse_and_scrape_site(mod, options["full_crawl"])
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 348, in parse_and_scrape_site
    site = mod.Site().parse()

Feb 26 '24 16:02 sentry[bot]

All of our SCOTUS scrapers appear to be blocked.

They all return 403 unless I change the user agent

Feb 26 '24 16:02 flooie

Greaaat. Do you have an open line of comms with them?

Feb 26 '24 16:02 mlissner

I do not - I reached out via their webmaster form last week - to ask about subscribing to all RSS feeds - and have not heard back. I wasn't expecting to hear until today. But seems unfortunately timed.

I was going thru my mental rolodex at the moment to figure out who to contact first.

Feb 26 '24 16:02 flooie

I started https://github.com/freelawproject/crm/issues/423 so we can keep notes about our contacts over there.

Feb 26 '24 16:02 mlissner

Two things are occurring -

Certain User Agents are blocked from the court. Juriscraper is one.
Our IP address is blocked from accessing the court website.

Feb 26 '24 20:02 flooie

It's not just us - others are seeing similar blocks on user agents. This affects our oral argument scraper (which was expected).

Feb 27 '24 17:02 flooie

Reports that this will be resolved by end of day today. (we shall see)

Feb 29 '24 20:02 flooie

Looking at oral argument on our site, it seems like it's not yet resolved or our scraper didn't work. Our latest:

Vs. the SCOTUS website:

Mar 01 '24 18:03 mlissner

@grossir can you investigate

Mar 01 '24 19:03 flooie

We checked and had some extra info on the ErrorLog table on the DB. Errors look like this:

'"text/html" not in ['audio/mpeg']>, <ErrorLog: 2024-02-29 20:31:04.085362+00:00 - WARNING@scotus UnexpectedContentTypeError: http://www.supremecourt.gov/media/audio/mp3files/22-704.mp3

This doesn't reproduce locally (even when using User-Agent: CourtListener), so maybe it some kind of IP redirection/blocking? Anyway to inspect it on the prod server?

Note that this issue wasn't caused by adding the expected_content_type checks, which happened on Feb 9. We have oral arguments as recent as Feb 21st; which make me thinks it has more to do with the recent blockings from SCOTUS

Mar 01 '24 20:03 grossir

I can check on the server if you've got some commands for me to run.

Mar 01 '24 20:03 mlissner

If you can test this in a django shell on the server, this is a minimal example that mimics cl_scrape_oral_args. So we can peek into the request:

import requests
from juriscraper.oral_args.united_states.federal_appellate.scotus import Site

site = Site()
site.parse()
download_url = site[0]['download_urls']
headers={"User-Agent": "CourtListener"}
s = requests.session()
r = s.get(download_url,verify=False,headers=headers,cookies=site.cookies,timeout=300)
print(r.headers)
print(r.content[:100])

I think the page maybe redirecting to an error/blocked HTML page, we may see it with this code.

Otherwise you could simply try a wget from the server and see what happens

wget http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3

I get:

The 301 is from sending a http request, it redirects to the https version

$ wget http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3

--2024-03-01 15:54:36--  http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Resolving www.supremecourt.gov (www.supremecourt.gov)... 23.50.113.144, 23.50.113.134
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.50.113.144|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3 [following]
--2024-03-01 15:54:37--  https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.50.113.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43409970 (41M) [audio/mpeg]
Saving to: ‘22-976.mp3’

22-976.mp3                              100%[=============================================================================>]  41.40M  10.0MB/s    in 4.5s    

2024-03-01 15:54:42 (9.20 MB/s) - ‘22-976.mp3’ saved [43409970/43409970]

Mar 01 '24 20:03 grossir

Hm, the first one seems OK:

In [1]: import requests

In [2]: from juriscraper.oral_args.united_states.federal_appellate.scotus import Site

In [3]: 

In [3]: site = Site()

In [4]: site.parse()
Out[4]: <juriscraper.oral_args.united_states.federal_appellate.scotus.Site at 0x7fd716e941d0>

In [5]: download_url = site[0]['download_urls']

In [6]: headers={"User-Agent": "CourtListener"}

In [7]: s = requests.session()

In [8]: r = s.get(download_url,verify=False,headers=headers,cookies=site.cookies,timeout=300)

In [9]: print(r.headers)
{'Accept-Ranges': 'bytes', 'Content-Type': 'audio/mpeg', 'ETag': '"513708ca9a4bfc5d26024fe4d800a378:1709145623.368223"', 'Last-Modified': 'Wed, 28 Feb 2024 18:38:09 GMT', 'Server': 'AkamaiNetStorage', 'Content-Length': '43409970', 'Date': 'Fri, 01 Mar 2024 22:18:36 GMT', 'Connection': 'keep-alive', 'Server-Timing': 'cdn-cache; desc=HIT, edge; dur=1, ak_p; desc="1709331516777_388897933_83893725_29_7683_22_9_-";dur=1', 'content-disposition': 'attachment; filename="22-976.mp3"', 'Strict-Transport-Security': 'max-age=31536000'}

In [10]: print(r.content[:100])
b"ID3\x03\x00\x00\x00\x007(TIT2\x00\x00\x00*\x00\x00\x00(22-976) Garland, Att'y Gen. v. Cargill \x00TYER\x00\x00\x00\x06\x00\x00\x002024\x00TDAT\x00\x00\x00\x06\x00\x00\x000000\x00PRIV\x00\x00"

wget returns:

root@maintenance-ml:/opt/courtlistener# wget http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
--2024-03-01 22:19:51--  http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Resolving www.supremecourt.gov (www.supremecourt.gov)... 23.46.17.13, 23.46.17.31, 2600:1405:1800::6867:44d9, ...
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.46.17.13|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3 [following]
--2024-03-01 22:19:51--  https://www.supremecourt.gov/media/audio/mp3files/22-976.mp3
Connecting to www.supremecourt.gov (www.supremecourt.gov)|23.46.17.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43409970 (41M) [audio/mpeg]
Saving to: ‘22-976.mp3’

22-976.mp3                                                                      100%[======================================================================================================================================================================================================>]  41.40M   176MB/s    in 0.2s    

2024-03-01 22:19:52 (176 MB/s) - ‘22-976.mp3’ saved [43409970/43409970]

Mar 01 '24 22:03 mlissner

Seems OK... have we tried re running the scraper? Maybe it runs now?

Mar 01 '24 23:03 grossir

Doesn't seem to have gotten working yet:

https://www.courtlistener.com/?type=oa&q=&type=oa&order_by=dateArgued%20desc&court=scotus

We'll have to check in on Monday, when Bill can lend a hand.

Mar 01 '24 23:03 mlissner

hi guys- lets get this fixed! as everyone is looking for our stuff

Mar 04 '24 14:03 flooie

Hm... not sure what was going on but I decided to run a --fullcrawl

www-data@maintenance-wp:/opt/courtlistener$ python manage.py cl_scrape_oral_arguments --courts juriscraper.oral_args.united_states.federal_appellate.scotus --verbosity 2 --fullcrawl INFO Starting up the scraper. INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-976.mp3' INFO Successfully added audio file 90881: b"Garland, Att'y Gen. v. Cargill" INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/23-3.mp3' INFO Successfully added audio file 90882: b'Coinbase, Inc. v. Suski' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-7386.mp3' INFO Successfully added audio file 90883: b'McIntosh v. United States' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-529.mp3' INFO Successfully added audio file 90884: b'Cantero v. Bank of America, N.A.' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-555.mp3' INFO Successfully added audio file 90885: b'NetChoice, LLC v. Paxton' INFO Adding new document found at: b'http://www.supremecourt.gov/media/audio/mp3files/22-277.mp3' INFO Successfully added audio file 90886: b'Moody v. NetChoice, LLC' ...

and that appears to be enough for now- worth keep an eye on it -

Mar 04 '24 15:03 flooie