courtlistener
courtlistener copied to clipboard
Dirty `search_citation` data
I have identified 2 types of dirty citation data:
- duplicated citations that match duplicated opinions
- corrupt citations: the same citations for hundreds of different opinions
Duplicated citations that match opinion duplications
For example, the neutral citation 2013 IL 110810
maps to 2 opinion clusters 1, 2. The opinion text seems to be mostly the same, but they have been obtained at different times. One looks to come from the official reporter, the other was probably scraped at (pre)publication time
I would expect these "dirty" citations to match 2 or 3 opinion clusters, at most
Citations that match hundreds of opinions
I downloaded the latest citation file from the bulk data directory citations-2024-03-11.csv.bz2
, which has these columns "id", "volume", "reporter", "page", "type", "cluster_id"
, matching the DB model.
If we had no duplicated citations, a DISTINCT over "volume", "reporter", "page", "type"
would return the same number of rows as the whole table has. This query returns 7 524 430
rows, against the 9 987 094
rows on the whole dataset
I ran a GROUP BY, COUNT over those fields, and got 231 665
citations that match more than 3 opinion clusters. Some match hundreds
Top ten looks like this
reporter | volume | page | type | row count |
---|---|---|---|---|
Ill. Dec. | 307 | 312 | 2 | 400 |
Ohio | 2018 | 365 | 2 | 193 |
Ohio | 2018 | 1600 | 2 | 187 |
U.S.L.W. | 82 | 3182 | 4 | 168 |
U.S.L.W. | 82 | 3183 | 4 | 168 |
U.S.L.W. | 82 | 3186 | 4 | 168 |
U.S.L.W. | 82 | 3187 | 4 | 168 |
U.S.L.W. | 82 | 3188 | 4 | 168 |
U.S.L.W. | 82 | 3329 | 4 | 168 |
U.S.L.W. | 82 | 3406 | 4 | 168 |
Looking at the second one on Courtlistener shows that all the results have the same date... And most I have looked have the vlex
banner
Looking at the first one on Courtlistener
About USLW, 51 of the top 100 by count citations are from that reporter AND from volume 82. An example. Also seeing a lot of vlex
banners for this one. Maybe a data ingestion / merging issue?
Thanks Gianfranco. One thing to note is that citations are not unique. Because they refer to the page that something is published on, it's entirely possible for multiple decisions to be published on the same page. That said, more than 10-20 on a single page makes no sense.
@flooie can you make a plan for digging into these and seeing what we can learn and how to prioritize this against our other projects?
@flooie Could you share more about the underlying reasons for a single (volume, reporter, page) combination having multiple cluster id's?
So far I understand that it may be the case when:
- Multiple opinions are published on the same page. E.g. An opinion of a few sentences is preceded by another opinion on the same page.
- The reporter makes an n-th publication to the same volume, page. E.g. Errata are published to the same page as the opinion.
- Unintentional duplicates.
E.g. records for
(2022 WY 137)
1 and 2, seem equivalent.
Additionally, is there any guidance you can give on how to determine which cluster_id
to pick for cases (2.) and (3.)?