Fill `sd` gaps
Part of #929
sd
We have a gap from August 30th, 2019 to January 4th, 2023, which should amount to a little over 240 missing documents (totals are not so easy to count here)
The command to fill this specific gap (once the backscraper PR is merged on CL)
docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.sd --backscrape-start=2019 --backscrape-end=2022
Just checked the counts, we have filled the gaps. 73 for 2020, 69 for 2021, 81 for 2022 142 for 2019 (should be 67).
However, we have some new data problems
There are duplications from a merger tagged with vlex icon. This is related to freelawproject/courtlistener#3803 For example:
Apart from the duplication, a bunch of the backscraped opinions haven't extracted the text properly. Can be seen of both of the scraped examples above
While counting, I noticed that the same query logged in / logged out returns very different counts. This may be a problem for casual users / search engines?
Logged in:
Logged out: