some plugin assets not working on github codespaces
Description of the issue
Not sure if this is the kit or Codespaces
If a plugin's name has special characters like @ and /, for example @x-govuk/govuk-prototype-components, the assets fail to load on GitHub Codespaces
Steps to reproduce the issue
- Install the kit on GitHub Codespaces and install a plugin with special chars in the name, for example
npm install @x-govuk/govuk-prototype-components
- Go to Templates and look at the templates, any images or js will fail to load
From investigation, req.path normally gives something like
/plugin-assets/%40x-govuk%2Fgovuk-prototype-components/govuk-prototype-kit/init.js
but on Codespaces it gives
/plugin-assets/@x-govuk%252Fgovuk-prototype-components/govuk-prototype-kit/init.js
Actual vs expected behaviour
Ideally these plugins would work on Codespaces too
Wah wah.
Why don't we have a unique constraint on that hash? Seems like the database could have prevented this...
\d+ search_opinion on the docker compose DB returns
Indexes:
"search_opinion_pkey" PRIMARY KEY, btree (id)
"search_opinion_author_id_69e3caa8" btree (author_id)
"search_opinion_cluster_id_09bd537a" btree (cluster_id)
"search_opinion_date_created_76a4ddf9" btree (date_created)
"search_opinion_date_modified_524fb7ff" btree (date_modified)
"search_opinion_download_url_8428ad91" btree (download_url)
"search_opinion_download_url_8428ad91_like" btree (download_url varchar_pattern_ops)
"search_opinion_extracted_by_ocr_122ced11" btree (extracted_by_ocr)
"search_opinion_local_path_8c124953" btree (local_path)
"search_opinion_local_path_8c124953_like" btree (local_path varchar_pattern_ops)
"search_opinion_sha1_62196033" btree (sha1)
"search_opinion_sha1_62196033_like" btree (sha1 varchar_pattern_ops)
"unique_opinion_ordering_key" UNIQUE CONSTRAINT, btree (cluster_id, ordering_key)
search_opinion_sha1_62196033" btree (sha1)
So, it is indeed not unique (it should have "UNIQUE CONSTRAINT"). The index must be dropped and then build again, there is no way to add the UNIQUE constraint via ALTER INDEX But the duplicates must be corrected before re-creating it.
Yeah, makes sense. Let's begin with fixing the dupes and then return here.
When we add the unique constraint, I think we'll want a migration that adds a new index and then removes the old one. That way, if we have look-ups that are coming in during the migration, there will always be an index available.
Here's an instance of this I just ran into for a California Supreme Court case published August 22, 2024. It's perhaps worth a closer look because it also indicates inconsistent citation parsing.
Query: https://www.courtlistener.com/?q=Rattagan&type=o&order_by=score%20desc&stat_Published=on&court=cal
3 copies of the opinion:
- https://www.courtlistener.com/opinion/10049082/rattagan-v-uber-technologies-inc/?q=Rattagan&type=o&order_by=score+desc&stat_Published=on&court=cal
- https://www.courtlistener.com/opinion/10050073/rattagan-v-uber-technologies-inc/?q=Rattagan&type=o&order_by=score+desc&stat_Published=on&court=cal
- https://www.courtlistener.com/opinion/10072000/rattagan-v-uber-technologies-inc/?q=Rattagan&type=o&order_by=score+desc&stat_Published=on&court=cal
Opinions 10049082 and 10050073 show 16 authorities, but opinion 10072000 shows only 7. Why would that be?
Tagging in @flooie for the citation part.
@anseljh I had a chance to look into this, but I don’t have a definitive explanation for why one citation has more references than the other. It does seem odd at first glance. What I’ve noticed is that many of the missing supras aren’t showing up, which might explain some of the differences.
It doesn’t seem to be a parsing issue, though; it looks more like a search-related problem. Nearly all (though not all) of the missing authorities are actually marked as citation no-link in the source code. This suggests the issue lies with search and citation discovery, not parsing. That said, there are still quite a lot of missing no-link citations.
I’m wondering if we should consider highlighting found citations or making them visually distinct in some way—perhaps by using an underline or another marker?
We can re-use code from opinion versioning to delete same hash duplicates
- without creating stale links, via ClusterRedirection
- merging different relevant metadata field, like "blocked" status
This queryset has 2278 elements; each may have from a few to a hundred duplicates. I got the queryset counts, and we would delete 45524-2278 = 43246 exact hash duplicates
In [18]: sum([i['number_of_rows'] for i in qs])
Out[18]: 45524
from cl.search.models import ClusterRedirection, SOURCES, Opinion
from django.db.models import Count, Q
from cl.scrapers.management.commands.merge_opinion_versions import comparable_dockets, merge_metadata
# Group scraped opinions by hash
# Keep the groups with a single hash, and more than 1 row
# these are same-hash duplicates
qs = (
Opinion.objects
.filter(cluster__source=SOURCES.COURT_WEBSITE)
.exclude(Q(download_url="") | Q(download_url__isnull=True) | Q(sha1=""))
.values("sha1")
.annotate(
number_of_rows=Count("sha1"),
# compute the number of distinct hashes to prevent colliding with
# actual duplicates, which are not versions
number_of_hashes=Count("sha1", distinct=True),
)
.order_by()
.filter(number_of_rows__gte=2, number_of_hashes=1)
)
# for each group, we will keep a single opinion; let's prefer the latest
deleted = 0
for group in qs:
print("Processing group ", group)
op_to_keep, *to_delete = (
Opinion.objects.filter(sha1=group['sha1'])
.order_by("-date_created")
.select_related('cluster', 'cluster__docket')
)
for op_to_delete in to_delete:
# check that they have the same docket
if not comparable_dockets(op_to_keep.cluster.docket, op_to_delete.cluster.docket):
print("Not the same docket", op_to_keep.cluster.docket.id, op_to_delete.cluster.docket.id)
continue
# merge all metadata
updated_opinion = merge_metadata(op_to_keep, op_to_delete)
updated_cluster = merge_metadata(op_to_keep.cluster, op_to_delete.cluster)
# delete opinion
cluster_to_delete = op_to_delete.cluster
op_to_delete.delete()
# delete cluster
ClusterRedirection.create_from_clusters(
op_to_keep.cluster, cluster_to_delete, ClusterRedirection.DUPLICATE
)
cluster_to_delete.delete()
deleted += 1
if updated_opinion:
print("updating opinion %s", op_to_keep.id)
op_to_keep.save()
if updated_cluster:
print("updating cluster %s", op_to_keep.cluster.id)
op_to_keep.cluster.save()
We can re-use code from opinion versioning to delete same hash duplicates
- without creating stale links, via ClusterRedirection
- merging different relevant metadata field, like "blocked" status
This queryset has 2278 elements; each may have from a few to a hundred duplicates. I got the queryset counts, and we would delete 45524-2278 = 43246 exact hash duplicates
In [18]: sum([i['number_of_rows'] for i in qs]) Out[18]: 45524 from cl.search.models import ClusterRedirection, SOURCES, Opinion from django.db.models import Count, Q from cl.scrapers.management.commands.merge_opinion_versions import comparable_dockets, merge_metadata
Group scraped opinions by hash
Keep the groups with a single hash, and more than 1 row
these are same-hash duplicates
qs = ( Opinion.objects .filter(cluster__source=SOURCES.COURT_WEBSITE) .exclude(Q(download_url="") | Q(download_url__isnull=True) | Q(sha1="")) .values("sha1") .annotate( number_of_rows=Count("sha1"), # compute the number of distinct hashes to prevent colliding with # actual duplicates, which are not versions number_of_hashes=Count("sha1", distinct=True), ) .order_by() .filter(number_of_rows__gte=2, number_of_hashes=1) )
for each group, we will keep a single opinion; let's prefer the latest
deleted = 0 for group in qs: print("Processing group ", group) op_to_keep, *to_delete = ( Opinion.objects.filter(sha1=group['sha1']) .order_by("-date_created") .select_related('cluster', 'cluster__docket') )
for op_to_delete in to_delete: # check that they have the same docket if not comparable_dockets(op_to_keep.cluster.docket, op_to_delete.cluster.docket): print("Not the same docket", op_to_keep.cluster.docket.id, op_to_delete.cluster.docket.id) continue # merge all metadata updated_opinion = merge_metadata(op_to_keep, op_to_delete) updated_cluster = merge_metadata(op_to_keep.cluster, op_to_delete.cluster) # delete opinion cluster_to_delete = op_to_delete.cluster op_to_delete.delete() # delete cluster ClusterRedirection.create_from_clusters( op_to_keep.cluster, cluster_to_delete, ClusterRedirection.DUPLICATE ) cluster_to_delete.delete() deleted += 1 if updated_opinion: print("updating opinion %s", op_to_keep.id) op_to_keep.save() if updated_cluster: print("updating cluster %s", op_to_keep.cluster.id) op_to_keep.cluster.save()
I think the code should work well, I also believe that keeping the newest opinion is the best option, with the ClusterRedirection model we can already eliminate safely without getting 404 error pages.
Wouldn't it be good option that the code is in a management command to run it? In case new duplicates appear?
Ran
./manage.py delete_duplicates same_hash --verbosity 3
Output
{'same cluster': 47, 'same docket': 20101, 'deleted opinion': 40445, 'deleted cluster': 40398, 'deleted docket': 20344, 'not comparable docket': 2794, 'merging error': 16})
We are down to 67470 same hash duplicates, affecting 54 574 opinions; we deleted most of the 43 246 duplicates described above
courtlistener=> select sum(count_by_sha) - count(*) as total_duplicates, count(*) as sha1_groups_count from (select sha1, count(*) count_by_sha from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) a;
total_duplicates | sha1_groups_count
----------+-------
67470 | 54574
The opinions affected have the following creation distribution; some some
courtlistener=> select date_part('year', date_created), count(*) from search_opinion where sha1 in (select sha1 from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) group by date_part('year', date_created) order by 2 desc;
date_part | count
-----------+--------
2016 | 115545
2024 | 3618
2013 | 2763
2023 | 87
2025 | 14
2020 | 8
2017 | 4
2010 | 3
2014 | 2
(9 rows)
As for sources, most of these come from non scraper (source <> 'C') sources; which we ignored by design and should target on a second round
courtlistener=> select source, count(*) from search_opinioncluster a inner join (select cluster_id from search_opinion where sha1 in (select sha1 count_by_sha from search_opinion where sha1 <> '' group by sha1 having count(*) > 1)) b on b.cluster_id = a.id group by source order by 2 desc;
source | count
--------+-------
ZU | 99081
Z | 16261
C | 2961
L | 2052
G | 898
LU | 699
CU | 58
D | 14
ZLU | 11
RU | 4
ZL | 3
R | 1
LRU | 1
(13 rows)
Looking at the Columbia archive duplicates source = 'Z', most of them come in pairs, separated by a few miliseconds.
Maybe an error on the import code?
They seem pretty safe to merge, following the same logic as when we merged the scraper only sources
- abort the merge if there is any metadata difference (say, OpinionCluster.date_filed is different between the 2 clusters)
- merge metadata if one object has it and the other does not (say one cluster has a
syllabus, the other does not)
So, I think we should only relax the source condition in the existing code
https://github.com/freelawproject/courtlistener/blob/e51752a5a4f8b45ba21a831838eb72a0a2db489c/cl/scrapers/management/commands/delete_duplicates.py#L118
probably turn it into a command input / argument, to create the combination of sources we would want to merge
select a.date_created, case_name, a.id opinion_id, source, sha1, download_url from search_opinion a inner join search_opinioncluster b ON a.cluster_id = b.id where sha1 in (select sha1 from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) and source = 'Z' limit 200;
date_created | case_name | opinion_id | source | sha1 | download_url
-------------------------------+------------------------------------------------------------------------------+------------+--------+------------------------------------------+--------------
2016-07-06 05:56:55.498859+00 | Jones v. New Hanover Cty. Schools | 3638000 | Z | 00037b7d3a16851737ddea0669f39ddbfa4c0309 |
2016-07-06 05:56:55.491142+00 | Jones v. New Hanover Cty. Schools | 3637999 | Z | 00037b7d3a16851737ddea0669f39ddbfa4c0309 |
2016-07-06 06:56:12.70296+00 | State v. Abdul-Mumin, Unpublished Decision (2-10-2005) | 3727189 | Z | 000f490e57e9be726e46f64584a7049010bc24cf |
2016-07-06 06:56:12.711009+00 | State v. Abdul-Mumin, Unpublished Decision (2-10-2005) | 3727190 | Z | 000f490e57e9be726e46f64584a7049010bc24cf |
2016-07-06 06:41:26.883502+00 | Applegate v. Applegatae, 1724 (12-28-2007) | 3703907 | Z | 00127e1d4f79d7cf84bd306906dfc858fa95d135 |
2016-07-06 06:41:26.890619+00 | Applegate v. Applegatae, 1724 (12-28-2007) | 3703908 | Z | 00127e1d4f79d7cf84bd306906dfc858fa95d135 |
2016-07-06 06:42:38.898062+00 | State v. Bundy, Unpublished Decision (6-24-2005) | 3705897 | Z | 001a13523f6e69d2d04f748abbc52b57dcf93559 |
2016-07-06 06:42:38.907802+00 | State v. Bundy, Unpublished Decision (6-24-2005) | 3705898 | Z | 001a13523f6e69d2d04f748abbc52b57dcf93559 |
2016-07-05 23:35:10.141819+00 | Craft v. . Merrill and Another | 3584848 | Z | 00290522a7ae848b5c0604fee732bf3a83d65792 |
2016-07-05 23:35:10.148935+00 | Craft v. . Merrill and Another | 3584849 | Z | 00290522a7ae848b5c0604fee732bf3a83d65792 |
2016-07-06 06:47:05.136316+00 | State v. Rector, Unpublished Decision (10-1-2003) | 3712984 | Z | 004230367c1875a52ab9b45d973446ceff465530 |
2016-07-06 06:47:05.142522+00 | State v. Rector, Unpublished Decision (10-1-2003) | 3712985 | Z | 004230367c1875a52ab9b45d973446ceff465530 |
2016-07-06 06:48:53.839032+00 | Kennedy v. Conrad, Unpublished Decision (3-14-2002) | 3715912 | Z | 0045347a304869b73f81e4e5b688249f7bd1f223 |
2016-07-06 06:48:53.832939+00 | Kennedy v. Conrad, Unpublished Decision (3-14-2002) | 3715911 | Z | 0045347a304869b73f81e4e5b688249f7bd1f223 |
2016-07-05 20:45:51.160137+00 | Morrissey v. Police Jury of Vermilion Parish | 3474596 | Z | 00475822dd8cec5d88b0879135eb945bb4c90278 |
2016-07-05 20:45:51.166393+00 | Morrissey v. Police Jury of Vermilion Parish | 3474597 | Z | 00475822dd8cec5d88b0879135eb945bb4c90278 |
2016-07-06 06:28:20.368436+00 | State v. Thomas, Unpublished Decision (7-13-2000) | 3682894 | Z | 0050ab36f8dc0fc82b85eedd7225d64d11150f74 |
2016-07-06 06:28:20.361835+00 | State v. Thomas, Unpublished Decision (7-13-2000) | 3682893 | Z | 0050ab36f8dc0fc82b85eedd7225d64d11150f74 |
2016-07-06 09:56:54.313842+00 | Childers v. State | 3931690 | Z | 005776b0fd0f0da403daffca925f53d3da1a1dab |
2016-07-06 09:56:54.325421+00 | Childers v. State | 3931691 | Z | 005776b0fd0f0da403daffca925f53d3da1a1dab |
2016-07-06 06:41:30.488012+00 | In Re Carson, 2007ca00070 (10-22-2007) | 3704007 | Z | 005e1a6fc0e0985c860322b259878d7dde8538ae |
2016-07-06 06:41:30.479926+00 | In Re Carson, 2007ca00070 (10-22-2007) | 3704006 | Z | 005e1a6fc0e0985c860322b259878d7dde8538ae |
2016-07-06 00:06:01.575383+00 | The People v. . Clements | 3625269 | Z | 006d6ffc4e712df662e9cb509cff6732c6c1150c |
2016-07-06 00:06:01.581533+00 | The People v. . Clements | 3625270 | Z | 006d6ffc4e712df662e9cb509cff6732c6c1150c |
2016-07-06 06:31:57.908195+00 | Morrison v. Petro Evaluation, Unpublished Decision (10-21-2005) | 3688579 | Z | 006fc6ddff3c50b1fcdd503e1bee4a7cb6ff0757 |
2016-07-06 06:31:57.916008+00 | Morrison v. Petro Evaluation, Unpublished Decision (10-21-2005) | 3688580 | Z | 006fc6ddff3c50b1fcdd503e1bee4a7cb6ff0757 |
2016-07-06 06:38:26.489139+00 | State v. Ball, 07ap-818 (6-3-2008) | 3699055 | Z | 0073a751c30952e69cd70993672017a59d17a86e |
2016-07-06 06:38:26.495214+00 | State v. Ball, 07ap-818 (6-3-2008) | 3699056 | Z | 0073a751c30952e69cd70993672017a59d17a86e |
2016-07-06 07:11:41.084508+00 | Wilson v. Brush Wellman, Inc., Unpublished Decision (10-17-2002) | 3751658 | Z | 0083edb9c4424e2c9481c88cc6c11e51532991c8 |
2016-07-06 07:11:41.091379+00 | Wilson v. Brush Wellman, Inc., Unpublished Decision (10-17-2002) | 3751659 | Z | 0083edb9c4424e2c9481c88cc6c11e51532991c8 |
2016-07-06 06:57:46.075215+00 | State v. Smith, Unpublished Decision (10-20-1997) | 3729684 | Z | 008da6d094d654a64206e75df3f6dd016a2670ad |
2016-07-06 06:57:46.068159+00 | State v. Smith, Unpublished Decision (10-20-1997) | 3729683 | Z | 008da6d094d654a64206e75df3f6dd016a2670ad |
2016-07-06 05:58:03.379422+00 | Walker v. N.C. D.O.T. | 3640032 | Z | 00ab609e3beb4650a2a95f9c17dded7f87698d1a |
2016-07-06 05:58:03.386411+00 | Walker v. N.C. D.O.T. | 3640033 | Z | 00ab609e3beb4650a2a95f9c17dded7f87698d1a |
2016-07-06 07:27:01.533687+00 | State v. Rakaf, 2008-P-0057 (12-31-2008) | 3776587 | Z | 00ba828baebef578a56bf18047c4ce432ec17968 |
2016-07-06 07:27:01.526016+00 | State v. Rakaf, 2008-P-0057 (12-31-2008) | 3776586 | Z | 00ba828baebef578a56bf18047c4ce432ec17968 |
2016-07-06 05:58:30.740587+00 | Johnson v. City of Winston-Salem | 3640842 | Z | 00c25a6702cd8819eb0781fa252b817d735d3015 |
2016-07-06 05:58:30.748391+00 | Johnson v. City of Winston-Salem | 3640843 | Z | 00c25a6702cd8819eb0781fa252b817d735d3015 |
2016-07-06 07:02:40.528436+00 | State v. Blanchard, 90935 (3-26-2009) | 3737489 | Z | 00d11d22924e99370aa7afd6000610b45e6fa1ec |
2016-07-06 07:02:40.535805+00 | State v. Blanchard, 90935 (3-26-2009) | 3737490 | Z | 00d11d22924e99370aa7afd6000610b45e6fa1ec |
2016-07-06 07:21:53.563336+00 | State Ex Rel. Petro v. Marshall, Unpublished Decision (10-10-2006) | 3768302 | Z | 00d29d3f089820c4024ad40fdd47aa4f2f22fa4c |
2016-07-06 07:21:53.572104+00 | State Ex Rel. Petro v. Marshall, Unpublished Decision (10-10-2006) | 3768303 | Z | 00d29d3f089820c4024ad40fdd47aa4f2f22fa4c |
2016-07-05 22:05:36.388811+00 | In Re Estate of Taylor | 3498141 | Z | 00d2a6cced199d443f0cbc18f986f1377a77c233 |
2016-07-05 22:05:36.395971+00 | In Re Estate of Taylor | 3498142 | Z | 00d2a6cced199d443f0cbc18f986f1377a77c233 |
2016-07-06 07:10:26.032799+00 | Copelco Capital v. St. Mark's Church, Unpublished Decision (2-1-2001) | 3749824 | Z | 00d6d8830307918961ba155a7372b82eb4f1c910 |
2016-07-06 07:10:26.027705+00 | Copelco Capital v. St. Mark's Church, Unpublished Decision (2-1-2001) | 3749823 | Z | 00d6d8830307918961ba155a7372b82eb4f1c910 |
2016-07-05 20:43:11.132823+00 | Oubre v. Mutual Life Ins. Co. of New York | 3473002 | Z | 00de1b8d5a763dde41dbf8b93baefda4f0f5de7a |
2016-07-05 20:43:11.125891+00 | Oubre v. Mutual Life Ins. Co. of New York | 3473001 | Z | 00de1b8d5a763dde41dbf8b93baefda4f0f5de7a |
2016-07-06 06:41:56.33185+00 | State v. Tolar, Unpublished Decision (10-31-2003) | 3704710 | Z | 00df9819a45162c427cc4d270c95cdac4d92cf82 |
Checking the Lawbox duplicates source = 'L'
- not timestamp clustered
- probably running the import twice, without checking for duplication
courtlistener=> select a.date_created, case_name, a.id opinion_id, source, sha1, download_url from search_opinion a inner join search_opinioncluster b ON a.cluster_id = b.id where sha1 in (select sha1 from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) and source = 'L' limit 200;
date_created | case_name | opinion_id | source | sha1 | download_url
-------------------------------+---------------------------------------------------------+------------+--------+------------------------------------------+--------------
2013-10-30 09:14:16.05283+00 | Olivar v. Nooth | 2358485 | L | 0027244bdf84359cf019951d3d7d99a25edf0c5f |
2013-11-01 21:00:35.650582+00 | Olivar v. Nooth | 2631281 | L | 0027244bdf84359cf019951d3d7d99a25edf0c5f |
2013-11-01 20:45:33.819633+00 | Smith v. Executive Custom Homes, Inc. | 2626035 | L | 002be05587129cf10808c5e4b31a26a5d3a03aeb |
2013-10-30 09:16:18.825383+00 | Mills v. Western Washington University | 2369882 | L | 004fef16166a853d4be13415d7606c0f87b3a3ca |
2013-11-01 20:52:20.262574+00 | Mills v. WESTERN WASHINGTON UNIVERSITY | 2628541 | L | 004fef16166a853d4be13415d7606c0f87b3a3ca |
2013-11-01 20:38:58.289027+00 | BE & K. CONST. v. Abbott | 2621238 | L | 005a88bdd566cb38aeef2dc6c3d59d021dc6c9c1 |
2013-11-01 21:00:58.154575+00 | State v. ALVERTO | 2631472 | L | 0083d993f083c6e6816ed9c6b048406a86b4c9f3 |
2013-10-30 09:16:12.859072+00 | State v. ALVERTO | 2369322 | L | 0083d993f083c6e6816ed9c6b048406a86b4c9f3 |
2013-11-01 20:44:37.689611+00 | Citizens for Resp. Growth v. Rci Dev't Ptr. | 2625445 | L | 00fd98ea6d0602a33831a17b499d7746a1578486 |
2013-11-01 20:53:10.018283+00 | Ahlschlager v. LAWTON SCHOOL DIST. | 2629040 | L | 017a4012e489a14eaf559f3b2c371a598c239da7 |
2013-11-01 20:44:52.448814+00 | Hicks v. Londre | 2625502 | L | 018f120de6e1020bd1be54656634a5f21389ca68 |
2013-11-01 20:43:37.230223+00 | STATE EX REL. OKLAHOMA BAR ASS'N v. Edwards | 2624756 | L | 01c886e38f82351aac9db0ffedd0b64484663fd8 |
2013-11-01 20:53:19.004749+00 | State v. Vargas-Torres | 2629065 | L | 01cbbcaee92c2ed7c1981031bda9f86f1b597922 |
2013-10-30 09:17:06.653186+00 | Sather v. City of Spokane | 2374802 | L | 02462bb2df50792f242fc0ad86779557092707d9 |
2013-11-01 21:01:05.517301+00 | Sather v. City of Spokane | 2631521 | L | 02462bb2df50792f242fc0ad86779557092707d9 |
2013-11-01 20:44:17.257844+00 | Andrus v. Andrus | 2625163 | L | 0252722b1fb32178ec71584d89826a4ac29eec4e |
2013-10-30 09:18:04.081236+00 | St. Joseph Gen. Hosp. v. Dept. of Revenue | 2380520 | L | 027956fa029df58fcecf992258e8f10c9825dc01 |
2013-11-01 20:53:33.723249+00 | St. Joseph Gen. Hosp. v. Dept. of Revenue | 2629148 | L | 027956fa029df58fcecf992258e8f10c9825dc01 |
2013-11-01 21:31:35.696104+00 | In Re Doe | 2640443 | L | 0281fcef17c96341df58ca617517ed495c2c8ade |
2013-11-01 20:52:24.588853+00 | State v. Hernandez-Lopez | 2628565 | L | 0290887f633b8c9c5ce437c18b7f7f68ac0ba2a1 |
2013-10-30 09:16:35.107471+00 | State v. Hernandez-Lopez | 2371613 | L | 0290887f633b8c9c5ce437c18b7f7f68ac0ba2a1 |
2013-11-01 20:44:48.488829+00 | Jc v. Dungarvin Colorado, LLC | 2625467 | L | 02a3ca1851cce5c7bfb71f1a150bd6f62626d88b |
2013-11-01 20:43:28.675387+00 | State v. Hager | 2624708 | L | 0308170b66bd55c6428b24d47285b662c4db71fe |
2013-10-30 09:15:44.860015+00 | State v. Hager | 2366473 | L | 0308170b66bd55c6428b24d47285b662c4db71fe |
2013-11-01 21:17:58.644906+00 | Horton v. Mitchell | 2633377 | L | 03141fc6fa7fbace237b631d23c52152016e1d97 |
2013-11-01 20:53:26.900092+00 | Rivera-Longoria v. Slayton | 2629095 | L | 031ee323ddd9586044da690331a40fecc0fa5845 |
2013-10-30 10:56:44.052404+00 | In Re Personal Restraint Petition of Silas | 2589978 | L | 037c8b731717ffafc2dfd758abb0a4d2b9f8a183 |
2013-11-01 20:42:42.081011+00 | In Re Personal Restraint Petition of Silas | 2624113 | L | 037c8b731717ffafc2dfd758abb0a4d2b9f8a183 |
2013-10-30 11:00:56.891017+00 | Biggers v. City of Bainbridge Island | 2595071 | L | 03840ddcba7bac31346068afbbd0a6e1f5433e7f |
2013-11-01 20:44:16.115047+00 | Biggers v. City of Bainbridge Island | 2625138 | L | 03840ddcba7bac31346068afbbd0a6e1f5433e7f |
2013-10-30 09:18:29.525797+00 | State v. Vars | 2382308 | L | 03c98e7d91facfe15aa7a050037f3af78d453831 |
2013-11-01 21:01:52.675138+00 | State v. Vars | 2631946 | L | 03c98e7d91facfe15aa7a050037f3af78d453831 |
2013-11-01 20:42:43.061601+00 | State v. Anderson | 2624122 | L | 041d6e8588a3dd746086aee6ac00d9503bfbed65 |
2013-11-01 20:52:08.167247+00 | West v. Reed | 2628391 | L | 043989e79ac383c7cddedcf19fc4fd8873662e13 |
2013-10-30 09:14:05.407474+00 | West v. Reed | 2357687 | L | 043989e79ac383c7cddedcf19fc4fd8873662e13 |
2013-11-01 21:02:14.450992+00 | People v. Lynch | 2632100 | L | 0447f692fa504488982ba8572057654628564015 |
2013-11-01 20:44:29.958702+00 | Normandeau v. HANSON EQUIPMENT, INC. | 2625340 | L | 045672790a8a933dd17d4d9b8e1e607ac336b718 |
2013-11-01 20:43:42.660102+00 | State v. Taylor | 2624805 | L | 04834554be2109e0a328c3c85670749f0422c911 |
2013-10-30 09:17:44.551268+00 | State v. Taylor | 2378761 | L | 04834554be2109e0a328c3c85670749f0422c911 |
2013-11-01 21:01:02.355339+00 | Smith v. Holbrook | 2631506 | L | 049f82e1c9b3a7de35a9206f716dbbe0ce6b5e0d |
2013-10-30 09:16:45.347923+00 | Smith v. Holbrook | 2372628 | L | 049f82e1c9b3a7de35a9206f716dbbe0ce6b5e0d |
2013-10-30 08:35:57.886318+00 | Jordan v. BELLEQUE | 2205065 | L | 04e1515f3074d748d1297af9f84d991bda78e92a |
2013-11-01 21:01:51.080407+00 | Jordan v. BELLEQUE | 2631927 | L | 04e1515f3074d748d1297af9f84d991bda78e92a |
2013-11-01 20:53:13.907113+00 | State v. Sievers | 2629058 | L | 054d8e3ad3dae94cafaa79d564db73996d1905fe |
2013-10-30 09:15:31.002335+00 | State v. Sievers | 2365199 | L | 054d8e3ad3dae94cafaa79d564db73996d1905fe |
2013-10-30 09:16:07.807751+00 | State v. Spradlin | 2368795 | L | 0550b671f479fa8eaa595563c57fa66f2c76b57f |
2013-11-01 20:43:32.352155+00 | State v. Spradlin | 2624734 | L | 0550b671f479fa8eaa595563c57fa66f2c76b57f |
2013-10-30 09:15:16.087202+00 | Charlton v. TOYS" R" US-DELAWARE, INC. | 2363612 | L | 057f15792759a91a211174f3e54416747fb66586 |
2013-11-01 20:52:13.705656+00 | Charlton v. TOYS" R" US-DELAWARE, INC. | 2628453 | L | 057f15792759a91a211174f3e54416747fb66586 |
Only 776 hashes have more than 1 source;
select
sha1, count(distinct(source))
from search_opinioncluster a
inner join (
select cluster_id, sha1
from search_opinion
where sha1 in (
select sha1
from search_opinion
where sha1 <> ''
group by sha1
having count(*) > 1
)
) b on b.cluster_id = a.id
group by sha1
having count(distinct(source)) > 1
order by 2 desc
;
Source L
Ran the command for source L (lawbox) and got an error due to a bug in the merging code when trying to migrate related objects from one cluster to the other
./manage.py delete_duplicates same_hash --cluster-sources L --verbosity 3
INFO Groups to process 670 for sources ['L']
...
INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7efd345f2b60>, {'deleted opinion': 39, 'deleted cluster': 39, 'deleted docket': 39})
delete_duplicates_source_L.txt
Update
After fixing the merging bug
INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7fa453362d40>, {'deleted opinion': 623, 'deleted cluster': 623, 'deleted docket': 623, 'merging error': 8})
delete-duplicates-source-l.txt
Source LU
./manage.py delete_duplicates same_hash --cluster-sources LU L --verbosity 3
INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7ff6329a2d40>, {'deleted opinion': 569, 'deleted cluster': 569, 'deleted docket': 569, 'not comparable docket': 121, 'merging error': 17})
delete-duplicates-source-lu-l.txt
Source Z
Ran the command for source Z and got a bunch of unexpected differences preventing the merge
A lot of Sentry events (filter to August 20, 2025 around 9:19 PM UTC)
A bunch of these seem to be an error in the hash assignment. See this example with 2 sub opinions that have the exact same Opinion.sha1 but actually have different content. They shouldn't have the same hash at all
- https://www.courtlistener.com/api/rest/v4/opinions/3695705/
- https://www.courtlistener.com/api/rest/v4/opinions/3695706/
Source G
Ran the command for source G; most were not merged due to having different docket numbers. This seems to be happening due to the same opinion existing one time for each of the consolidated dockets. This needs further analysis
See an example opinion
The clusters with the same hash opinion above
- Reece, No. 1:19-CV-219;
- Bamrick, No.1:19-CV-225;
- Driscoll, No. 1:19-CV-231;
- Gates, No. 1:19-CV-221;
- Slowey, No. 1:19-CV-216;
- Webber, No. 1:19-CV-220;
- Wyman, No. 1:19-CV-215.
delete_duplicates_source_g.txt
./manage.py delete_duplicates same_hash --cluster-sources G --verbosity 3
INFO Groups to process 344 for sources ['G']
INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7fa687eceb60>, {'not comparable docket': 505, 'same docket': 47, 'deleted opinion': 47, 'deleted cluster': 47, 'merging error': 2})
Not sure why these have not been deleted yet
select
court_id, download_url, count, nsha1
from (
select download_url, max(cluster_id) as cluster_id,count(*) as count, count(distinct(sha1)) as nsha1
from search_opinion where download_url <> ''
group by download_url having count(*) > 50
) a
inner join
search_opinioncluster soc on cluster_id=soc.id
inner join search_docket sd on sd.id = docket_id
order by 2 desc;
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55043&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion | 143 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55019&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion | 142 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55013&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion | 142 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55009&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion | 142 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42499&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion | 58 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42496&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion | 58 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42467&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion | 57 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42466&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion | 57 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42464&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion | 57 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42462&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion | 57 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24829&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24828&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24827&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24823&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24822&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24819&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24818&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24814&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texcrimapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24813&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion | 93 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=14438&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa12%5cOpinion | 107 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=14435&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa12%5cOpinion | 107 | 1
texapp | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=12447&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa10%5cOpinion | 130 | 1
ind | https://public.courts.in.gov/Decisions/api/Document/Opinion?Id=CjY8lm-9eeN6IGLO9oee_CIsFS9mLqs4IDMF1VXNtvBXzkLz-mZimOUf5k2w7esT0 | 170 | 1
ind | https://public.courts.in.gov/Decisions/api/Document/Opinion?Id=46HRQ-DEpRSJkJne5gp7Splkf_ezdYB0et7R5x84rFkH3Psf0EG3oq5tnZngeO6v0 | 171 | 1
Ran ./manage.py delete_duplicates same_hash --verbosity 3 after implementing this improvement. After this run, we are left with
{'not comparable docket': 8403, 'same docket': 1254, 'deleted opinion': 2598, 'deleted cluster': 2598, 'merging error': 16, 'deleted docket': 1344}
delete-same-hash-duplicates-log.txt
-
for SCRAPER "C" source: hard edge cases that won't merge due to having different metadata. Some of these may be corrections by the courts (16 merging errors), other may be bugs in docket assignment on our part ("not comparable docket", but since comparisons are done in combinations, the real number is lesser)
-
for other sources / mixed sources:
- for HARVARD, we are still computing the hashes, but a lot of them will have repeated hashes due their content being just short texts
- for HARVARD mixed sources, we still need more work