juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

`ri` URL has changed

Open grossir opened this issue 1 year ago • 5 comments

Sentry Issue: COURTLISTENER-7EK

HTTPError: 404 Client Error: Not Found for url: https://www.courts.ri.gov/Courts/SupremeCourt/SupremeOrders/Forms/20232024.aspx
(2 additional frame(s) were not displayed)
...
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 387, in handle
    self.parse_and_scrape_site(mod, options)
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 350, in parse_and_scrape_site
    site = mod.Site().parse()

We need to update the scraper to the new endpoint https://www.courts.ri.gov/Pages/ood.aspx?k=(RIJCourt:%27Supreme%27)%20AND%20(ContentType:%27RIJOpinion%27)

Explaning the changes

We have 2 ri scrapers, ri_u.py and ri_p.py. They get these URLs, where the years used to change depending on the court terms. From a comment on the script: "This court hears things from mid-September to end of June. This defines the "term" for that year, which triggers the website updates.", The term ends on :term_end = datetime(this_year, 9, 15)

https://www.courts.ri.gov/Courts/SupremeCourt/SupremeOpinions/Forms/20232024.aspx https://www.courts.ri.gov/Courts/SupremeCourt/SupremeOrders/Forms/20232024.aspx

So, we may think that the URLs were updated before the term ended, but they do not exist neither for future nor past terms

https://www.courts.ri.gov/Courts/SupremeCourt/SupremeOpinions/Forms/20242025.aspx https://www.courts.ri.gov/Courts/SupremeCourt/SupremeOrders/Forms/20242025.aspx

https://www.courts.ri.gov/Courts/SupremeCourt/SupremeOpinions/Forms/20222023.aspx https://www.courts.ri.gov/Courts/SupremeCourt/SupremeOrders/Forms/20222023.aspx

grossir avatar Jun 13 '24 16:06 grossir

Sentry Issue: COURTLISTENER-7EK

sentry[bot] avatar Jun 13 '24 16:06 sentry[bot]

Sentry Issue: COURTLISTENER-7EJ

sentry[bot] avatar Jun 13 '24 16:06 sentry[bot]

isnt that wonderful

flooie avatar Jul 01 '24 17:07 flooie

Ugh - I took a look at this and have a few thoughts.

  1. this should be an easy rewrite - they still expose a JSON endpoint
  2. the U/P distinction should be merged into one scraper. the current website provides published opinions and published orders that I think should be collected
  3. The miscellaneous court orders etc can be ignored imho.
  4. The endpoint is pliable to allow for scraping 100s of opinions at a time and is quite responsive.

the only weirdness is the schemas.microsoft.com/sharepoint xml that is required for the parameters.

flooie avatar Jul 01 '24 19:07 flooie

partially fixed. but appears to stop early.

flooie avatar Jul 02 '24 22:07 flooie

Closing this because the scraper is back up

flooie avatar Jul 03 '24 20:07 flooie