juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Get citations using scrapers

Open flooie opened this issue 2 years ago • 17 comments

I'm going to workshop my thoughts on prioritization here - and welcome feedback and thoughts.

flooie avatar Jan 12 '24 18:01 flooie

@grossir can you please add your suggestion for using back scrapers to collect citations or other material posted later.

flooie avatar Jul 19 '24 16:07 flooie

Sure @flooie

This would work for sources that have

  1. Have a "citation" column on their HTML pages
  2. The court leaves it as a placeholder for some time, until it populates it

An example is md, compare this 2 images from 2023 and 2024 (the current year), where citations are not populated yet

image image

The approach is to run the backscraper with a custom caller. Here is some pseudocode

from juriscraper.opinions.united_states.state import md as scraper_module
from juriscraper.lib.importer import site_yielder
from cl.search.models import Opinion, OpinionCluster
from cl.scrapers.management.commands.cl_scrape_opinions import make_citation
import logging


logger = logging.getLogger(__name__)

class CitationCollector:
    def scrape_citations(self, start_date, end_date):
        for site in site_yielder(
            scraper_module.Site(
                backscrape_start=start_date,
                backscrape_end=end_date,
            ).back_scrape_iterable,
            scraper_module,
        ):
            # get case dicts by parsing HTML
            site.parse()
            
            court_id = scraper_module.court_id.split("/")[-1].split("_")[0]
            
            for record in site:
                citation = record['citations']
                if not citation:
                    continue
                
                # get cluster using download_url or hash of the document
                cluster = Opinion.objects.get(download_url=record['download_urls']).cluster
                
                # check if citation exists
                if self.citation_exists(citation, cluster):
                    logger.info("Citation already exists '%s' for cluster %s", record['citations'], cluster.id)
                    continue
                    
                citation = make_citation(citation, cluster, court_id)
                citation.save()
                
    def citation_exists(self, citation, cluster):
        """To implement"""
        return False
    

grossir avatar Jul 19 '24 20:07 grossir

Simple enough. Is it a good idea to analyze this across all states to figure out:

  • Which have it
  • How delayed each is
  • How far back each goes
  • How difficult each is to scrape
  • ?

Thank you guys.

Also, should we spin this off into it's own ticket and task? My hope was to use this issue to discuss high level architecture of a new Juriscraper system, not features we want to add?

mlissner avatar Jul 22 '24 14:07 mlissner

I have a spreadsheet that looked at each state - and where these citations could be pulled from. In many cases the citations appear later on the scrapers and in others there is a second cite that could be scraped. The two probably are lexis or west cites that could be scraped (maybe).

https://docs.google.com/spreadsheets/d/1zYP_4ivL2XQF8mlrgdTmzXB57sTn6UYv8GrrRkq7X5Q/edit?usp=sharing

STATE CITES COUNT
YES 27
PROBABLE 2
UNCLEAR 6
NO 16

10 with neutral citations

flooie avatar Jul 23 '24 13:07 flooie

That's not too bad! Let's keep filling this in with info about how far back each goes, and things like that.

mlissner avatar Jul 23 '24 14:07 mlissner

Yes but I think many of these links are unrelated to the current scrapers - so it's more of a jumping off point for this .

flooie avatar Jul 23 '24 14:07 flooie

A draft list that answers the questions, organized by "How difficult each is to scrape" in the sense if we have the scraper already implemented

Which have it? How delayed each is? How difficult each is to scrape?

I haven't checked this, how far back each goes

Sources that publish citations in the same URL we scrape

In other words, we just need to run (or implement) the backscraper with a custom caller

Source Time lag until citation is published Example
md 1 year See above, most recent citation is from August 2023
scotus_slip 1 month Most recent citation is 602 U.S. 406 for 22-976 Garland v. Cargill published on June 14, 2024
colo 3 months Earliest non neutral citation I could find is 545 P.3d 942 for a decision from 15 April 2024, which we do not have in the CL
minn 3 months Earliest citation I could find is 5 N.W.3d 680 for an opinion from May 1st 2024 (today is August 12th)
ohio 6 months Earliest citation I could find is 175 Ohio St.3d 155 for an opinion from April 4th 2024 (today is Sep 10th)
texapp ?? The citations for texapp are available in the new tex source we are scraping
haw 3 months "156 Haw. 144" From June 30, 2025 (it is September 8th, 2025 at time of writing). Currently we only have up to volume 140 for that reporter

Sources that have a neutral citation inside the opinion's document, but we didn't extract it

To collect past neutral citations, we would need to run the recently updated scraper with extract_from_text against older Opinion.plain_text already in the DB

Source citation extractor implemented in Status Date since collected Citations added
vt PR Done 2017-01-01 658
wis Sep 3rd, 2024. PR Done 2020-01-01 386
wisctapp ... Done 2020-01-01 0
pasuperct PR Pending ? ?
or and orctapp TBD Done ? 2417 for orctapp, 171 for or

Sources that publish an updated document version with the citation

Source citation extractor implemented in Status Lag until document update Last citation from reporter in CL
ga Pending Pending End of year? In March 2025, documents from late December 2024 have a "FINAL COPY" version June 28th, 2019, 306 Ga. 351
nm Done Versioning problem End of year? As of April 2 2025, the latest citation in the source is 2025-NMSC-009 - 12/06/2024 2021 NMSC 008
neb Pending Pending 5 years or more. As of Aug 2025, most recent citation is from 01/21/2020. Documents will be tagged as "Certified" instead of "Advance when they contain a regional citation. See with missing citation 938 N.W.2d 378

Sources that need a backscraper for a different URL than we scrape

In other words, the backscraper may need to go into the united_states_backscrapers, if not a different category folder

Source Time lag until citation is published Example Modification required
okla 2 months Most recent citation 549 P.3d 1260 for case published in 05/21/2024, KNOX v. OKLAHOMA GAS AND ELECTRIC CO. We don't have the citation in CL We just changed the target URL, but we have code in the Git history to scrape and parse the site where citations are published.
conn 1.5 months Most recent citation 349 Conn. 417 for case published in 06/25/2024 We would have to scrape a different page, and extract the data from PDFs, but they are nicely separated, 1 link per each opinion back to volume 326 from 2017. Before, back to volume 320, is a single PDF for all opinions

grossir avatar Jul 24 '24 20:07 grossir

Thanks @grossir. Should we rename this issue to be about capturing citations, and make a new one to talk about Juriscraper 3.0 architecture?

mlissner avatar Jul 25 '24 15:07 mlissner

Happy to report that the citation backscraper is working, just ran it in prod on md and will soon run it with scotus_slip.

image

Added 305 citations by running

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.md --backscrape-start=2019 --backscrape-end=2023 --verbosity 3

Also added 89 opinions, some of which may be opinions we already had, for which the hash has changed due to corrections

grossir avatar Aug 22 '24 16:08 grossir

We ran this for scotus_slip, only term 22, and duplicated all records from that term. If the duplications are not too big of a problem, we could run it for all of scotus_slip and get all the citations that we are missing

Anyway, it would be very nice to address the duplication problem https://github.com/freelawproject/courtlistener/issues/3803

The command:

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.federal_appellate.scotus_slip --backscrape-start=2023/01/01 --backscrape-end=2023/06/01 --verbosity 3

grossir avatar Aug 23 '24 21:08 grossir

Yikes, those duplicates aren't great, no. Let's clean that up somehow, and figure out how to avoid dups before we have 20M opinions. :)

mlissner avatar Aug 23 '24 22:08 mlissner

For sources where the citations are inside the document's text, but we just recently implemented extract_from_text to get them, we can run a script like the following (currently, we can do this over vt, wis and wisctapp)

from juriscraper.opinions.united_states.state.vt import Site
from cl.search.models import OpinionCluster, Citation
from django.db import transaction
import traceback

"""
Tested with the following clusters:

Already has a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4335586

Recent document, Doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 10099996

Is an order, doesn't have a neutral citation
python manage.py clone_from_cl --type search.OpinionCluster --id 10044928

Old document (2017), doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4489376
"""


site = Site()
# according to the citations search page, 
# latest VT neutral citations we have are from 2015
# https://www.courtlistener.com/c/vt/

# However, we can find neutral citations from 2017?
# https://www.courtlistener.com/opinion/4335586/representative-donald-turner-jr-and-senator-joseph-benning-v-governor/

query = """
SELECT *
FROM search_opinioncluster
WHERE 
        docket_id IN (SELECT id FROM search_docket sd WHERE court_id = 'vt') 
    AND
        id NOT IN (
            SELECT cluster_id 
            FROM search_citation
            WHERE reporter = 'VT'
        ) 
    AND
        precedential_status = 'Published'
    AND
        date_filed > '2018-01-01'::date
"""
# This query selects all 'vt' opinion clusters created from 2018 or later
# which do not have a "VT" reporter neutral citation
# It queries over indexes

success, failure, iterated = 0, 0, 0
queryset = OpinionCluster.objects.raw(query).prefetch_related('sub_opinions')
for cluster in queryset:
    iterated += 1
    
    for opinion in cluster.sub_opinions.all():
        metadata = site.extract_from_text(opinion.plain_text)
        if not metadata:
            continue

        citation_kwargs = metadata['Citation']
        citation_kwargs = cluster.id
        
        try:
            with transaction.atomic():
                Citation.objects.create(**citation_kwargs)
                print(f"Created citation {citation_kwargs}")
            success += 1
        except Exception:
            print(f"Failed creating citation for {citation_kwargs}")
            print(traceback.format_exc())
            failure += 1
        
print(f"Created {success}\nFailed {failure}\nIterated {iterated}")

grossir avatar Sep 10 '24 03:09 grossir

I've noticed two citation gaps in Ohio, both documented in courtlistener issue #3882.

  1. Missing neutral citations in unpublished cases. I think this perhaps happened because Ohio added neutral citations at some point in time that may have been after we scraped. It's possible other states have done this. I haven't systematically tested the extent of this, but I think there are lots of these.
  2. Missing neutral citations in published cases. Again, some webcites have been added retroactively, so if we got print cases from Harvard (especially 1990s and early 2000s), we may not have the neutral citation parallel cite.

Both of these issues have increased urgency because, as I note in that issue, Ohio Supreme Court has changed style rules to only require neutral citations when they are available, so we're going to start to see a lot of new published opinions that only refer to prior cases by neutral citation.

rlfordon avatar Jan 10 '25 17:01 rlfordon

Just ran the command to get md lagged citations. Got

  • 56 citations added to an existing cluster
  • 14 citations added to a new cluster, meaning we got 2 versions to merge for a single opinion
  • deleted 7 hash duplicates that were causing the command to error; I did it using the admin and making sure no undesired cascades happened

./manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.md --backscrape-start=2024 --backscrape-end=2024 --verbosity 3

INFO Starting up the scraper.
INFO Using court_str: "md"
INFO Now downloading case page at: https://www.mdcourts.gov/cgi-bin/indexlist.pl?court=coa&year=2024&order=bydate&submit=Submit
INFO juriscraper.opinions.united_states.state.md: Successfully found 92 items.
DEBUG No citation, skipping row for case Willey v. Brown
DEBUG No citation, skipping row for case Hollins v. State
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Franklin
DEBUG No citation, skipping row for case Scott v. Hon. Bowman
DEBUG No citation, skipping row for case Reinstatement of Kirwan
DEBUG No citation, skipping row for case Reinstatement of Assaraf
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Loots
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Gormley
DEBUG No citation, skipping row for case Feng v. Chen
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Elan
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Yeatman
DEBUG No citation, skipping row for case State Bd. of Elections v. Ambridge
DEBUG No citation, skipping row for case State v. Scarboro
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Mahoney
DEBUG No citation, skipping row for case Reinstatement of Tabe
DEBUG No citation, skipping row for case Reinstatement of Gordon
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. O'Neill
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Mayers
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Gallagher
DEBUG No citation, skipping row for case Greenmark Properties v. Parts, Inc.
INFO Case 'Syed v. Lee', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/7a23.pdf' has no matching hash in the DB. Has a citation '488 Md. 537'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/7a23.pdf'
INFO Successfully added opinion 11080272: b'Syed v. Lee'
DEBUG No citation, skipping row for case Bethesda African Cemetery Coal. v. Housing Opp. Comm.
INFO Case 'State v. Thomas', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/15a23.pdf' has no matching hash in the DB. Has a citation '488 Md. 456'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/15a23.pdf'
WARNING , Retrying in 5 seconds...
INFO Successfully added opinion 11080273: b'State v. Thomas'
INFO Saved citation 488 Md. 534 for cluster 10098671: Frederick v. Baltimore City BOE
INFO Saved citation 488 Md. 531 for cluster 10098672: Balt. City BOE v. Mayor & City Cncl. of Balt
INFO Saved citation 488 Md. 454 for cluster 10078915: Attorney Grievance Comm'n v. O'Neill
INFO Saved citation 488 Md. 455 for cluster 10079360: Attorney Grievance Comm'n v. Koh
INFO Saved citation 488 Md. 410 for cluster 10079814: Adventist Healthcare v. Behram
INFO Saved citation 488 Md. 326 for cluster 10046356: In the Matter of McCloy
INFO Saved citation 488 Md. 354 for cluster 10046293: Cook v. State
INFO Saved citation 488 Md. 384 for cluster 10046476: Attorney Grievance Comm'n v. Goldscher
INFO Case 'Turenne v. State', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/20a23.pdf' has no matching hash in the DB. Has a citation '488 Md. 239'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/20a23.pdf'
INFO Successfully added opinion 11080278: b'Turenne v. State'
INFO Case 'Rovin v. State', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/19a23.pdf' has no matching hash in the DB. Has a citation '488 Md. 144'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/19a23.pdf'
INFO Successfully added opinion 11080279: b'Rovin v. State'
INFO Saved citation 488 Md. 45 for cluster 10041667: In the Matter of Hon. Ademiluyi
INFO Case 'Mitchell v. State', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/8a23.pdf' has no matching hash in the DB. Has a citation '488 Md. 1'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/8a23.pdf'
INFO Successfully added opinion 11080280: b'Mitchell v. State'
INFO Case 'State v. Smith', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/30a23.pdf' has no matching hash in the DB. Has a citation '487 Md. 635'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/30a23.pdf'
INFO Successfully added opinion 11080281: b'State v. Smith'
INFO Saved citation 487 Md. 701 for cluster 10039238: Mooney v. State
INFO Saved citation 487 Md. 632 for cluster 10039604: Katz, Abosch, etc., P.A. v. Parkway Neuroscience
INFO Case 'Jarvis v. State', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/22a23.pdf' has no matching hash in the DB. Has a citation '487 Md. 548'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/22a23.pdf'
INFO Successfully added opinion 11080282: b'Jarvis v. State'
INFO Saved citation 487 Md. 487 for cluster 10038343: Bennett v. Gentile
INFO Saved citation 487 Md. 501 for cluster 10027703: Attorney Grievance Comm'n v. Whitted
INFO Saved citation 487 Md. 476 for cluster 10025931: Doctor's Weight Loss Ctrs. v. Blackston
INFO Saved citation 487 Md. 474 for cluster 10020257: Attorney Grievance Comm'n v. Glenn
INFO Saved citation 487 Md. 455 for cluster 10013950: Attorney Grievance Comm'n v. Waldeck
INFO Case 'Attorney Grievance Comm'n v. Hardy', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/4a24ag.pdf' has no matching hash in the DB. Has a citation '487 Md. 456'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/4a24ag.pdf'
INFO Successfully added opinion 11080283: b"Attorney Grievance Comm'n v. Hardy"
INFO Saved citation 487 Md. 454 for cluster 10013355: Application of Lenk to Resign from Bar
INFO Case 'Freeman v. State', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/24a23.pdf' has no matching hash in the DB. Has a citation '487 Md. 420'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/24a23.pdf'
INFO Successfully added opinion 11080284: b'Freeman v. State'
INFO Saved citation 487 Md. 385 for cluster 10010704: Lithko Contracting v. XL Insurance Amer.
INFO Saved citation 487 Md. 354 for cluster 10010705: Town of Bel Air v. Bodt
INFO Saved citation 487 Md. 383 for cluster 9998430: Reinstatement of Tauber
INFO Saved citation 487 Md. 382 for cluster 9998460: Attorney Grievance Comm'n v. Gallagher
INFO Saved citation 487 Md. 384 for cluster 9998514: Attorney Grievance Comm'n v. Davis
DEBUG No citation, skipping row for case Attorney Grievance Comm'n v. Mosby
INFO Case 'Cunningham ex rel Gaines v. Baltimore Cnty.', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/9a23.pdf' has no matching hash in the DB. Has a citation '487 Md. 282'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/9a23.pdf'
INFO Successfully added opinion 11080285: b'Cunningham ex rel Gaines v. Baltimore Cnty.'
INFO Saved citation 487 Md. 260 for cluster 9567459: Attorney Grievance Comm'n v. Lamm
INFO Saved citation 487 Md. 260 for cluster 9834681: Attorney Grievance Comm'n v. Baker
INFO Saved citation 497 Md. 258 for cluster 9509513: Resper v. Dept. of Pub. Saf. & Corr. Servs.
INFO Saved citation 487 Md. 254 for cluster 9509248: Reinstatement of Ibebuchi
INFO Saved citation 487 Md. 256 for cluster 9509249: Attorney Grievance Comm'n v. Teitelbaum
INFO Saved citation 487 Md. 255 for cluster 9509352: Attorney Grievance Comm'n v. Tappan
INFO Saved citation 487 Md. 257 for cluster 9509300: Attorney Grievance Comm'n v. Nelson
INFO Saved citation 487 Md. 214 for cluster 9509127: Walker v. State
INFO Saved citation 487 Md. 216 for cluster 9508885: Mason v. State
INFO Case 'Gonzalez v. State', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/23a23.pdf' has no matching hash in the DB. Has a citation '487 Md. 136'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/23a23.pdf'
INFO Successfully added opinion 11080286: b'Gonzalez v. State'
INFO Saved citation 487 Md. 133 for cluster 9508757: In the Matter of Hon. Ademiluyi
INFO Saved citation 487 Md. 53 for cluster 9495884: In Re: M.P.
INFO Saved citation 487 Md. 52 for cluster 9495373: Reinstatement of Jeffrey to the Bar of Md.
INFO Saved citation 487 Md. 52 for cluster 9495372: Reinstatement of Moody to the Bar of Md.
INFO Case 'Riley v. Venice Beach Citizens Ass'n', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/5a23.pdf' has no matching hash in the DB. Has a citation '487 Md. 1'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/5a23.pdf'
INFO Successfully added opinion 11080287: b"Riley v. Venice Beach Citizens Ass'n"
INFO Saved citation 486 Md. 616 for cluster 9487483: Westminster Management v. Smith
INFO Saved citation 486 Md. 613 for cluster 9487573: Resper v. Dept. of Pub. Saf. & Corr. Servs.
INFO Saved citation 486 Md. 683 for cluster 9487484: Matthews v. State
INFO Saved citation 486 Md. 596 for cluster 9486688: Attorney Grievance Comm'n v. Moir
INFO Case 'Attorney Grievance Comm'n v. Kurtyka', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/44a23ag.pdf' has no matching hash in the DB. Has a citation '486 Md. 594'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/44a23ag.pdf'
INFO Successfully added opinion 11080288: b"Attorney Grievance Comm'n v. Kurtyka"
INFO Saved citation 486 Md. 593 for cluster 9485836: Attorney Grievance Comm'n v. Goldstein
INFO Saved citation 486 Md. 501 for cluster 9481791: Resignation of King Jr.
INFO Saved citation 486 Md. 496 for cluster 9479195: Harvey v. DeMarinis
INFO Saved citation 486 Md. 454 for cluster 9478896: Attorney Grievance Comm'n v. Donnelly
INFO Saved citation 486 Md. 408 for cluster 9486514: Petition of the Off. Of People's Counsel
INFO Saved citation 486 Md. 502 for cluster 9477504: In the Matter of SmartEnergy
INFO Saved citation 486 Md. 407 for cluster 9477505: Attorney Grievance Comm'n v. Anderson
INFO Saved citation 486 Md. 386 for cluster 9477329: Attorney Grievance Comm'n v. Weinberg
INFO Saved citation 486 Md. 385 for cluster 9477330: Attorney Grievance Comm'n v. Johnson
INFO Saved citation 486 Md. 384 for cluster 9477331: Attorney Grievance Comm'n v. Chang
INFO Saved citation 486 Md. 383 for cluster 9477332: Application of Sausser to Resign
INFO Saved citation 486 Md. 382 for cluster 9477333: Application of Patterson to Resign
INFO Case 'Motor Vehicle Admin. v. Usan', opinion 'https://www.mdcourts.gov/data/opinions/coa/2024/6a23.pdf' has no matching hash in the DB. Has a citation '486 Md. 352'. Will try to ingest all objects
INFO Adding new document found at: b'https://www.mdcourts.gov/data/opinions/coa/2024/6a23.pdf'
INFO Successfully added opinion 11080289: b'Motor Vehicle Admin. v. Usan'
INFO Saved citation 486 Md. 338 for cluster 9477335: Reinstatement of Sloane to the Bar of Md.
INFO Saved citation 486 Md. 338 for cluster 9477336: Reinstatement of Kilroy to the Bar of Md.
INFO Saved citation 486 Md. 340 for cluster 9477337: Attorney Grievance Comm'n v. Johnson
INFO Saved citation 486 Md. 641 for cluster 9477338: Attorney Grievance Comm'n v. Buie
INFO Saved citation 486 Md. 339 for cluster 9477339: Attorney Grievance Comm'n v. Bobotek

grossir avatar Jun 19 '25 16:06 grossir

After ga versioning was mostly solved, I ran the backscraper and got 788 new Ga. citations, for years 2022 to 2025, with at most 55 versioning failures (meaning, the opinion containing the citation was ingested and the previous version couldn't be linked)

Details
./manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.ga --backscrape-start=2022 --backscrape-end=2026 --verbosity 3

ga-backscrape-citations.txt

courtlistener=> select count(*) from search_citation where reporter  = 'Ga.' and date_created::date = '2025-09-30'::date;
 count 
-------
   788
(1 row)


courtlistener=> select count(*), sum((main_version_id is not null)::int), sum((main_version_id is not null)::int)*2 from search_opinion where cluster_id in (select cluster_id from search_citation where reporter  = 'Ga.' and date_created::date = '2025-09-30'::date);
 count | sum | ?column? 
-------+-----+----------
  1521 | 733 |     1466


grossir avatar Sep 30 '25 01:09 grossir

Ran the command for haw and hawapp. Got 1683 "Haw." citations

./manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.haw --backscrape-start=2018/01/01 --backscrape-end=2025/01/01 --verbosity 3 --backscrape-wait=10

./manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.hawapp --backscrape-start=2018/01/01 --backscrape-end=2025/01/01 --verbosity 3 --backscrape-wait=10
courtlistener=>  select count(*) from search_citation where reporter  = 'Haw.' and date_created::date = '2025-10-29'::date;
 count 
-------
  1683
(1 row)

log-hawapp-citations.txt

logs-haw-citations.txt

grossir avatar Oct 29 '25 20:10 grossir

very nice

flooie avatar Oct 31 '25 14:10 flooie