juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Fill `illappct` gaps

Open grossir opened this issue 1 year ago • 3 comments

Part of #929

Between October 22, 2019 and May 21, 2020 we have 0 documents. We are missing around 1600 documents (152 per page, 11 full pages)

Between May 29, 2021 and November 15, 2021 we have 0 documents. We are missing around 1350 documents

To solve this, a dynamic backscraper will be implemented.

grossir avatar Mar 20 '24 17:03 grossir

Looks like we won't need to worry about the 2015/6 stuff and the HTML stuff based on your research.

flooie avatar Mar 22 '24 17:03 flooie

On illappct from 2010 to the past, most rows have no citation string, thus the current scraper won't get the docket number. To support that we would need to enhance the scraper.

However, it seems we do not need to backscrape earlier years for this source. We have a lot of data for illappct. For example, for 2006, the source returns 5 pages, which is at most 750 records. However, on CL we have ~2800 opinions for this court in this period, which is 4x the amount in the source...

A quick check shows that we have some duplicates:

Example 1: a, b Example 2: a, b

Still, that doesn't explain the 4x amount...

(I am closing and reopening the PR since it is bugged again and isn't picking up the last commit I pushed)

grossir avatar Mar 23 '24 01:03 grossir

Commands to fill the gaps:

docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.illappct --backscrape-start=10/21/2019 --backscrape-end=05/20/2020

docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.illappct --backscrape-start=05/28/2021 --backscrape-end=11/16/2021

grossir avatar May 08 '24 15:05 grossir

For the full date range we now have 2899 documents, very close to the expected estimation of 2950 documents.

From the logs, 130 documents were skipped due to having no URL WARNING Opinion '2021 IL App (1st) 161797-U' has no URL. (Likely a withdrawn opinion).

4 due to having no docket WARNING Opinion '2021 IL (2d) 200636' has no docket.

And some cases where the URL was broken

ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/37ca7e26-546f-4cc6-b546-66a3e3a4e72a/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/27e44736-1c31-4ede-80fa-0ca5b71d25ad/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/9c8cfe9a-e85d-44cb-bedf-bf818f13e9c6/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/cd715f77-1b51-46a8-a83e-1e6fe8c969f4/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/2dc04981-6a2a-4f6e-bac4-3ecb479bb2da/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/a53f4980-6011-4317-9cfc-3d19add9ee6f/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/88d8c0da-85da-417d-b5a0-fa437444c729/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/959f4646-289a-405e-ac2a-93af38bfec77/file
ERROR UnexpectedContentTypeError: https://www.illinoiscourts.gov/resources/ea39977b-5b95-4cc4-8302-ae6b389f1bbd/file

grossir avatar Jul 31 '24 16:07 grossir