Rossi
Rossi
Related to #883 - add jsonschema dependencies - create JSON Schemas for each scraped object, corresponding to courtlistener's Django Models - validate scraped data using JSONSchemaValidator - support nested objects...
Assumes backscrape keyword arguments from the dynamic backscraping PR Solves #944
Related to #929 Between May 05, 2020 and August 20, 2020 we have [2 documents](https://www.courtlistener.com/?q=court_id%3Atex&type=o&order_by=dateFiled%20asc&stat_Precedential=on&stat_Non-Precedential=on&filed_after=05/05/2020&filed_before=08/20/2020). We are missing 47 documents This will need to updated `tex` to handle backscrapes, and...
This is part of #929 Missing around 100 documents Between April 27, 2018 and February 13, 2019 we have [1 document1](https://www.courtlistener.com/?q=court_id%3Any&type=o&order_by=dateFiled%20asc&stat_Precedential=on&stat_Non-Precedential=on&filed_after=04%2F27%2F2018&filed_before=02%2F13%2F2019). We are missing [92 documents](https://iapps.courts.state.ny.us/lawReporting/CourtOfAppealsSearch?searchType=opinion) Between June 16, 2023...
The scraper is picking the HTML "detail" page link instead of the PDF link. The XPATH should be updated We have a bunch of HTML pages and duplicates on CL...
I think there are 3 big classes of gaps: - **0 gap**: when we have 0 documents for a time period, and have a regular count before and after. We...
The scraper is skipping all citable opinions. It skips rows which have no PDF links in the first column. Coincidentaly, all the opinions with a citation string have no such...
There is more data available in the HTML we already request, that we don't parse (the scraping class is in `tex.py`) This is an instance of #889 At least these...
The current method to get data from a secondary page is to use a `DeferringList`. This method is designed for parsing a single field. However, we may want to get...
We had an error on courlistener when extracting date_filed using `extract_from_text` from recently added bap1 ``` { "OpinionCluster": {"date_filed": "July 29, 2022"}, }, ... File “/opt/courtlistener/cl/scrapers/tasks.py”, line 179, in extract_doc_content...