govuk-prototype-kit icon indicating copy to clipboard operation
govuk-prototype-kit copied to clipboard

some plugin assets not working on github codespaces

Open joelanman opened this issue 1 year ago • 1 comments

Description of the issue

Not sure if this is the kit or Codespaces

If a plugin's name has special characters like @ and /, for example @x-govuk/govuk-prototype-components, the assets fail to load on GitHub Codespaces

Steps to reproduce the issue

  1. Install the kit on GitHub Codespaces and install a plugin with special chars in the name, for example
npm install @x-govuk/govuk-prototype-components
  1. Go to Templates and look at the templates, any images or js will fail to load

From investigation, req.path normally gives something like

/plugin-assets/%40x-govuk%2Fgovuk-prototype-components/govuk-prototype-kit/init.js

but on Codespaces it gives

/plugin-assets/@x-govuk%252Fgovuk-prototype-components/govuk-prototype-kit/init.js

Actual vs expected behaviour

Ideally these plugins would work on Codespaces too

joelanman avatar Aug 20 '24 15:08 joelanman

Wah wah.

Why don't we have a unique constraint on that hash? Seems like the database could have prevented this...

mlissner avatar Aug 29 '24 16:08 mlissner

\d+ search_opinion on the docker compose DB returns

Indexes:
    "search_opinion_pkey" PRIMARY KEY, btree (id)
    "search_opinion_author_id_69e3caa8" btree (author_id)
    "search_opinion_cluster_id_09bd537a" btree (cluster_id)
    "search_opinion_date_created_76a4ddf9" btree (date_created)
    "search_opinion_date_modified_524fb7ff" btree (date_modified)
    "search_opinion_download_url_8428ad91" btree (download_url)
    "search_opinion_download_url_8428ad91_like" btree (download_url varchar_pattern_ops)
    "search_opinion_extracted_by_ocr_122ced11" btree (extracted_by_ocr)
    "search_opinion_local_path_8c124953" btree (local_path)
    "search_opinion_local_path_8c124953_like" btree (local_path varchar_pattern_ops)
    "search_opinion_sha1_62196033" btree (sha1)
    "search_opinion_sha1_62196033_like" btree (sha1 varchar_pattern_ops)
    "unique_opinion_ordering_key" UNIQUE CONSTRAINT, btree (cluster_id, ordering_key)

search_opinion_sha1_62196033" btree (sha1)

So, it is indeed not unique (it should have "UNIQUE CONSTRAINT"). The index must be dropped and then build again, there is no way to add the UNIQUE constraint via ALTER INDEX But the duplicates must be corrected before re-creating it.

grossir avatar Sep 06 '24 00:09 grossir

Yeah, makes sense. Let's begin with fixing the dupes and then return here.

When we add the unique constraint, I think we'll want a migration that adds a new index and then removes the old one. That way, if we have look-ups that are coming in during the migration, there will always be an index available.

mlissner avatar Sep 06 '24 18:09 mlissner

Here's an instance of this I just ran into for a California Supreme Court case published August 22, 2024. It's perhaps worth a closer look because it also indicates inconsistent citation parsing.

Query: https://www.courtlistener.com/?q=Rattagan&type=o&order_by=score%20desc&stat_Published=on&court=cal

3 copies of the opinion:

  • https://www.courtlistener.com/opinion/10049082/rattagan-v-uber-technologies-inc/?q=Rattagan&type=o&order_by=score+desc&stat_Published=on&court=cal
  • https://www.courtlistener.com/opinion/10050073/rattagan-v-uber-technologies-inc/?q=Rattagan&type=o&order_by=score+desc&stat_Published=on&court=cal
  • https://www.courtlistener.com/opinion/10072000/rattagan-v-uber-technologies-inc/?q=Rattagan&type=o&order_by=score+desc&stat_Published=on&court=cal

Opinions 10049082 and 10050073 show 16 authorities, but opinion 10072000 shows only 7. Why would that be?

Tagging in @flooie for the citation part.

anseljh avatar Oct 10 '24 03:10 anseljh

@anseljh I had a chance to look into this, but I don’t have a definitive explanation for why one citation has more references than the other. It does seem odd at first glance. What I’ve noticed is that many of the missing supras aren’t showing up, which might explain some of the differences.

It doesn’t seem to be a parsing issue, though; it looks more like a search-related problem. Nearly all (though not all) of the missing authorities are actually marked as citation no-link in the source code. This suggests the issue lies with search and citation discovery, not parsing. That said, there are still quite a lot of missing no-link citations.

I’m wondering if we should consider highlighting found citations or making them visually distinct in some way—perhaps by using an underline or another marker?

flooie avatar Oct 10 '24 16:10 flooie

We can re-use code from opinion versioning to delete same hash duplicates

  • without creating stale links, via ClusterRedirection
  • merging different relevant metadata field, like "blocked" status

This queryset has 2278 elements; each may have from a few to a hundred duplicates. I got the queryset counts, and we would delete 45524-2278 = 43246 exact hash duplicates

In [18]: sum([i['number_of_rows'] for i in qs])
Out[18]: 45524
from cl.search.models import ClusterRedirection, SOURCES, Opinion
from django.db.models import Count, Q
from cl.scrapers.management.commands.merge_opinion_versions import comparable_dockets, merge_metadata


# Group scraped opinions by hash
# Keep the groups with a single hash, and more than 1 row
# these are same-hash duplicates
qs = (
    Opinion.objects
    .filter(cluster__source=SOURCES.COURT_WEBSITE)
    .exclude(Q(download_url="") | Q(download_url__isnull=True) | Q(sha1=""))
    .values("sha1")
    .annotate(
        number_of_rows=Count("sha1"),
        # compute the number of  distinct hashes to prevent colliding with
        # actual duplicates, which are not versions
        number_of_hashes=Count("sha1", distinct=True),
    )
    .order_by()
    .filter(number_of_rows__gte=2, number_of_hashes=1)
)


# for each group, we will keep a single opinion; let's prefer the latest
deleted = 0
for group in qs:
    print("Processing group ", group)
    op_to_keep, *to_delete = (
        Opinion.objects.filter(sha1=group['sha1'])
        .order_by("-date_created")
        .select_related('cluster', 'cluster__docket')
    )
    
    for op_to_delete in to_delete:
        # check that they have the same docket
        if not comparable_dockets(op_to_keep.cluster.docket, op_to_delete.cluster.docket):
            print("Not the same docket", op_to_keep.cluster.docket.id, op_to_delete.cluster.docket.id)
            continue
        
        # merge all metadata
        updated_opinion = merge_metadata(op_to_keep, op_to_delete)
        updated_cluster = merge_metadata(op_to_keep.cluster, op_to_delete.cluster)
        
        # delete opinion
        cluster_to_delete = op_to_delete.cluster
        op_to_delete.delete()
        
        # delete cluster
        ClusterRedirection.create_from_clusters(
            op_to_keep.cluster, cluster_to_delete, ClusterRedirection.DUPLICATE
        )
        cluster_to_delete.delete()
        deleted += 1
    
    if updated_opinion:
        print("updating opinion %s", op_to_keep.id)
        op_to_keep.save()
    if updated_cluster:
        print("updating cluster %s", op_to_keep.cluster.id)
        op_to_keep.cluster.save()

grossir avatar Aug 07 '25 21:08 grossir

We can re-use code from opinion versioning to delete same hash duplicates

  • without creating stale links, via ClusterRedirection
  • merging different relevant metadata field, like "blocked" status

This queryset has 2278 elements; each may have from a few to a hundred duplicates. I got the queryset counts, and we would delete 45524-2278 = 43246 exact hash duplicates

In [18]: sum([i['number_of_rows'] for i in qs]) Out[18]: 45524 from cl.search.models import ClusterRedirection, SOURCES, Opinion from django.db.models import Count, Q from cl.scrapers.management.commands.merge_opinion_versions import comparable_dockets, merge_metadata

Group scraped opinions by hash

Keep the groups with a single hash, and more than 1 row

these are same-hash duplicates

qs = ( Opinion.objects .filter(cluster__source=SOURCES.COURT_WEBSITE) .exclude(Q(download_url="") | Q(download_url__isnull=True) | Q(sha1="")) .values("sha1") .annotate( number_of_rows=Count("sha1"), # compute the number of distinct hashes to prevent colliding with # actual duplicates, which are not versions number_of_hashes=Count("sha1", distinct=True), ) .order_by() .filter(number_of_rows__gte=2, number_of_hashes=1) )

for each group, we will keep a single opinion; let's prefer the latest

deleted = 0 for group in qs: print("Processing group ", group) op_to_keep, *to_delete = ( Opinion.objects.filter(sha1=group['sha1']) .order_by("-date_created") .select_related('cluster', 'cluster__docket') )

for op_to_delete in to_delete:
    # check that they have the same docket
    if not comparable_dockets(op_to_keep.cluster.docket, op_to_delete.cluster.docket):
        print("Not the same docket", op_to_keep.cluster.docket.id, op_to_delete.cluster.docket.id)
        continue
    
    # merge all metadata
    updated_opinion = merge_metadata(op_to_keep, op_to_delete)
    updated_cluster = merge_metadata(op_to_keep.cluster, op_to_delete.cluster)
    
    # delete opinion
    cluster_to_delete = op_to_delete.cluster
    op_to_delete.delete()
    
    # delete cluster
    ClusterRedirection.create_from_clusters(
        op_to_keep.cluster, cluster_to_delete, ClusterRedirection.DUPLICATE
    )
    cluster_to_delete.delete()
    deleted += 1

if updated_opinion:
    print("updating opinion %s", op_to_keep.id)
    op_to_keep.save()
if updated_cluster:
    print("updating cluster %s", op_to_keep.cluster.id)
    op_to_keep.cluster.save()

I think the code should work well, I also believe that keeping the newest opinion is the best option, with the ClusterRedirection model we can already eliminate safely without getting 404 error pages.

Wouldn't it be good option that the code is in a management command to run it? In case new duplicates appear?

quevon24 avatar Aug 09 '25 01:08 quevon24

Ran

./manage.py delete_duplicates same_hash --verbosity 3

Output

{'same cluster': 47, 'same docket': 20101, 'deleted opinion': 40445, 'deleted cluster': 40398, 'deleted docket': 20344, 'not comparable docket': 2794, 'merging error': 16})

We are down to 67470 same hash duplicates, affecting 54 574 opinions; we deleted most of the 43 246 duplicates described above

courtlistener=> select sum(count_by_sha) - count(*) as total_duplicates, count(*) as sha1_groups_count from (select sha1, count(*) count_by_sha from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) a;
 total_duplicates | sha1_groups_count 
----------+-------
    67470 | 54574

The opinions affected have the following creation distribution; some some

courtlistener=> select date_part('year', date_created), count(*) from search_opinion where sha1 in  (select sha1 from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) group by date_part('year', date_created) order by 2 desc;
 date_part | count  
-----------+--------
      2016 | 115545
      2024 |   3618
      2013 |   2763
      2023 |     87
      2025 |     14
      2020 |      8
      2017 |      4
      2010 |      3
      2014 |      2
(9 rows)

As for sources, most of these come from non scraper (source <> 'C') sources; which we ignored by design and should target on a second round

courtlistener=> select source, count(*) from search_opinioncluster a inner join (select cluster_id from search_opinion where sha1 in (select sha1 count_by_sha from search_opinion where sha1 <> '' group by sha1 having count(*) > 1)) b on b.cluster_id = a.id group by source order by 2 desc; 
 source | count 
--------+-------
 ZU     | 99081
 Z      | 16261
 C      |  2961
 L      |  2052
 G      |   898
 LU     |   699
 CU     |    58
 D      |    14
 ZLU    |    11
 RU     |     4
 ZL     |     3
 R      |     1
 LRU    |     1
(13 rows)

grossir avatar Aug 14 '25 16:08 grossir

Looking at the Columbia archive duplicates source = 'Z', most of them come in pairs, separated by a few miliseconds.

Maybe an error on the import code?

They seem pretty safe to merge, following the same logic as when we merged the scraper only sources

  • abort the merge if there is any metadata difference (say, OpinionCluster.date_filed is different between the 2 clusters)
  • merge metadata if one object has it and the other does not (say one cluster has a syllabus, the other does not)

So, I think we should only relax the source condition in the existing code

https://github.com/freelawproject/courtlistener/blob/e51752a5a4f8b45ba21a831838eb72a0a2db489c/cl/scrapers/management/commands/delete_duplicates.py#L118

probably turn it into a command input / argument, to create the combination of sources we would want to merge

 select a.date_created, case_name, a.id opinion_id, source,  sha1, download_url from search_opinion a inner join search_opinioncluster b ON a.cluster_id = b.id where sha1 in  (select sha1 from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) and source = 'Z' limit 200;

         date_created          |                                  case_name                                   | opinion_id | source |                   sha1                   | download_url 
-------------------------------+------------------------------------------------------------------------------+------------+--------+------------------------------------------+--------------
 2016-07-06 05:56:55.498859+00 | Jones v. New Hanover Cty. Schools                                            |    3638000 | Z      | 00037b7d3a16851737ddea0669f39ddbfa4c0309 | 
 2016-07-06 05:56:55.491142+00 | Jones v. New Hanover Cty. Schools                                            |    3637999 | Z      | 00037b7d3a16851737ddea0669f39ddbfa4c0309 | 
 2016-07-06 06:56:12.70296+00  | State v. Abdul-Mumin, Unpublished Decision (2-10-2005)                       |    3727189 | Z      | 000f490e57e9be726e46f64584a7049010bc24cf | 
 2016-07-06 06:56:12.711009+00 | State v. Abdul-Mumin, Unpublished Decision (2-10-2005)                       |    3727190 | Z      | 000f490e57e9be726e46f64584a7049010bc24cf | 
 2016-07-06 06:41:26.883502+00 | Applegate v. Applegatae, 1724 (12-28-2007)                                   |    3703907 | Z      | 00127e1d4f79d7cf84bd306906dfc858fa95d135 | 
 2016-07-06 06:41:26.890619+00 | Applegate v. Applegatae, 1724 (12-28-2007)                                   |    3703908 | Z      | 00127e1d4f79d7cf84bd306906dfc858fa95d135 | 
 2016-07-06 06:42:38.898062+00 | State v. Bundy, Unpublished Decision (6-24-2005)                             |    3705897 | Z      | 001a13523f6e69d2d04f748abbc52b57dcf93559 | 
 2016-07-06 06:42:38.907802+00 | State v. Bundy, Unpublished Decision (6-24-2005)                             |    3705898 | Z      | 001a13523f6e69d2d04f748abbc52b57dcf93559 | 
 2016-07-05 23:35:10.141819+00 | Craft v. . Merrill and Another                                               |    3584848 | Z      | 00290522a7ae848b5c0604fee732bf3a83d65792 | 
 2016-07-05 23:35:10.148935+00 | Craft v. . Merrill and Another                                               |    3584849 | Z      | 00290522a7ae848b5c0604fee732bf3a83d65792 | 
 2016-07-06 06:47:05.136316+00 | State v. Rector, Unpublished Decision (10-1-2003)                            |    3712984 | Z      | 004230367c1875a52ab9b45d973446ceff465530 | 
 2016-07-06 06:47:05.142522+00 | State v. Rector, Unpublished Decision (10-1-2003)                            |    3712985 | Z      | 004230367c1875a52ab9b45d973446ceff465530 | 
 2016-07-06 06:48:53.839032+00 | Kennedy v. Conrad, Unpublished Decision (3-14-2002)                          |    3715912 | Z      | 0045347a304869b73f81e4e5b688249f7bd1f223 | 
 2016-07-06 06:48:53.832939+00 | Kennedy v. Conrad, Unpublished Decision (3-14-2002)                          |    3715911 | Z      | 0045347a304869b73f81e4e5b688249f7bd1f223 | 
 2016-07-05 20:45:51.160137+00 | Morrissey v. Police Jury of Vermilion Parish                                 |    3474596 | Z      | 00475822dd8cec5d88b0879135eb945bb4c90278 | 
 2016-07-05 20:45:51.166393+00 | Morrissey v. Police Jury of Vermilion Parish                                 |    3474597 | Z      | 00475822dd8cec5d88b0879135eb945bb4c90278 | 
 2016-07-06 06:28:20.368436+00 | State v. Thomas, Unpublished Decision (7-13-2000)                            |    3682894 | Z      | 0050ab36f8dc0fc82b85eedd7225d64d11150f74 | 
 2016-07-06 06:28:20.361835+00 | State v. Thomas, Unpublished Decision (7-13-2000)                            |    3682893 | Z      | 0050ab36f8dc0fc82b85eedd7225d64d11150f74 | 
 2016-07-06 09:56:54.313842+00 | Childers v. State                                                            |    3931690 | Z      | 005776b0fd0f0da403daffca925f53d3da1a1dab | 
 2016-07-06 09:56:54.325421+00 | Childers v. State                                                            |    3931691 | Z      | 005776b0fd0f0da403daffca925f53d3da1a1dab | 
 2016-07-06 06:41:30.488012+00 | In Re Carson, 2007ca00070 (10-22-2007)                                       |    3704007 | Z      | 005e1a6fc0e0985c860322b259878d7dde8538ae | 
 2016-07-06 06:41:30.479926+00 | In Re Carson, 2007ca00070 (10-22-2007)                                       |    3704006 | Z      | 005e1a6fc0e0985c860322b259878d7dde8538ae | 
 2016-07-06 00:06:01.575383+00 | The People v. . Clements                                                     |    3625269 | Z      | 006d6ffc4e712df662e9cb509cff6732c6c1150c | 
 2016-07-06 00:06:01.581533+00 | The People v. . Clements                                                     |    3625270 | Z      | 006d6ffc4e712df662e9cb509cff6732c6c1150c | 
 2016-07-06 06:31:57.908195+00 | Morrison v. Petro Evaluation, Unpublished Decision (10-21-2005)              |    3688579 | Z      | 006fc6ddff3c50b1fcdd503e1bee4a7cb6ff0757 | 
 2016-07-06 06:31:57.916008+00 | Morrison v. Petro Evaluation, Unpublished Decision (10-21-2005)              |    3688580 | Z      | 006fc6ddff3c50b1fcdd503e1bee4a7cb6ff0757 | 
 2016-07-06 06:38:26.489139+00 | State v. Ball, 07ap-818 (6-3-2008)                                           |    3699055 | Z      | 0073a751c30952e69cd70993672017a59d17a86e | 
 2016-07-06 06:38:26.495214+00 | State v. Ball, 07ap-818 (6-3-2008)                                           |    3699056 | Z      | 0073a751c30952e69cd70993672017a59d17a86e | 
 2016-07-06 07:11:41.084508+00 | Wilson v. Brush Wellman, Inc., Unpublished Decision (10-17-2002)             |    3751658 | Z      | 0083edb9c4424e2c9481c88cc6c11e51532991c8 | 
 2016-07-06 07:11:41.091379+00 | Wilson v. Brush Wellman, Inc., Unpublished Decision (10-17-2002)             |    3751659 | Z      | 0083edb9c4424e2c9481c88cc6c11e51532991c8 | 
 2016-07-06 06:57:46.075215+00 | State v. Smith, Unpublished Decision (10-20-1997)                            |    3729684 | Z      | 008da6d094d654a64206e75df3f6dd016a2670ad | 
 2016-07-06 06:57:46.068159+00 | State v. Smith, Unpublished Decision (10-20-1997)                            |    3729683 | Z      | 008da6d094d654a64206e75df3f6dd016a2670ad | 
 2016-07-06 05:58:03.379422+00 | Walker v. N.C. D.O.T.                                                        |    3640032 | Z      | 00ab609e3beb4650a2a95f9c17dded7f87698d1a | 
 2016-07-06 05:58:03.386411+00 | Walker v. N.C. D.O.T.                                                        |    3640033 | Z      | 00ab609e3beb4650a2a95f9c17dded7f87698d1a | 
 2016-07-06 07:27:01.533687+00 | State v. Rakaf, 2008-P-0057 (12-31-2008)                                     |    3776587 | Z      | 00ba828baebef578a56bf18047c4ce432ec17968 | 
 2016-07-06 07:27:01.526016+00 | State v. Rakaf, 2008-P-0057 (12-31-2008)                                     |    3776586 | Z      | 00ba828baebef578a56bf18047c4ce432ec17968 | 
 2016-07-06 05:58:30.740587+00 | Johnson v. City of Winston-Salem                                             |    3640842 | Z      | 00c25a6702cd8819eb0781fa252b817d735d3015 | 
 2016-07-06 05:58:30.748391+00 | Johnson v. City of Winston-Salem                                             |    3640843 | Z      | 00c25a6702cd8819eb0781fa252b817d735d3015 | 
 2016-07-06 07:02:40.528436+00 | State v. Blanchard, 90935 (3-26-2009)                                        |    3737489 | Z      | 00d11d22924e99370aa7afd6000610b45e6fa1ec | 
 2016-07-06 07:02:40.535805+00 | State v. Blanchard, 90935 (3-26-2009)                                        |    3737490 | Z      | 00d11d22924e99370aa7afd6000610b45e6fa1ec | 
 2016-07-06 07:21:53.563336+00 | State Ex Rel. Petro v. Marshall, Unpublished Decision (10-10-2006)           |    3768302 | Z      | 00d29d3f089820c4024ad40fdd47aa4f2f22fa4c | 
 2016-07-06 07:21:53.572104+00 | State Ex Rel. Petro v. Marshall, Unpublished Decision (10-10-2006)           |    3768303 | Z      | 00d29d3f089820c4024ad40fdd47aa4f2f22fa4c | 
 2016-07-05 22:05:36.388811+00 | In Re Estate of Taylor                                                       |    3498141 | Z      | 00d2a6cced199d443f0cbc18f986f1377a77c233 | 
 2016-07-05 22:05:36.395971+00 | In Re Estate of Taylor                                                       |    3498142 | Z      | 00d2a6cced199d443f0cbc18f986f1377a77c233 | 
 2016-07-06 07:10:26.032799+00 | Copelco Capital v. St. Mark's Church, Unpublished Decision (2-1-2001)        |    3749824 | Z      | 00d6d8830307918961ba155a7372b82eb4f1c910 | 
 2016-07-06 07:10:26.027705+00 | Copelco Capital v. St. Mark's Church, Unpublished Decision (2-1-2001)        |    3749823 | Z      | 00d6d8830307918961ba155a7372b82eb4f1c910 | 
 2016-07-05 20:43:11.132823+00 | Oubre v. Mutual Life Ins. Co. of New York                                    |    3473002 | Z      | 00de1b8d5a763dde41dbf8b93baefda4f0f5de7a | 
 2016-07-05 20:43:11.125891+00 | Oubre v. Mutual Life Ins. Co. of New York                                    |    3473001 | Z      | 00de1b8d5a763dde41dbf8b93baefda4f0f5de7a | 
 2016-07-06 06:41:56.33185+00  | State v. Tolar, Unpublished Decision (10-31-2003)                            |    3704710 | Z      | 00df9819a45162c427cc4d270c95cdac4d92cf82 | 

grossir avatar Aug 14 '25 17:08 grossir

Checking the Lawbox duplicates source = 'L'

  • not timestamp clustered
  • probably running the import twice, without checking for duplication
courtlistener=> select a.date_created, case_name, a.id opinion_id, source,  sha1, download_url from search_opinion a inner join search_opinioncluster b ON a.cluster_id = b.id where sha1 in  (select sha1 from search_opinion where sha1 <> '' group by sha1 having count(*) > 1) and source = 'L' limit 200;
         date_created          |                        case_name                        | opinion_id | source |                   sha1                   | download_url 
-------------------------------+---------------------------------------------------------+------------+--------+------------------------------------------+--------------
 2013-10-30 09:14:16.05283+00  | Olivar v. Nooth                                         |    2358485 | L      | 0027244bdf84359cf019951d3d7d99a25edf0c5f | 
 2013-11-01 21:00:35.650582+00 | Olivar v. Nooth                                         |    2631281 | L      | 0027244bdf84359cf019951d3d7d99a25edf0c5f | 
 2013-11-01 20:45:33.819633+00 | Smith v. Executive Custom Homes, Inc.                   |    2626035 | L      | 002be05587129cf10808c5e4b31a26a5d3a03aeb | 
 2013-10-30 09:16:18.825383+00 | Mills v. Western Washington University                  |    2369882 | L      | 004fef16166a853d4be13415d7606c0f87b3a3ca | 
 2013-11-01 20:52:20.262574+00 | Mills v. WESTERN WASHINGTON UNIVERSITY                  |    2628541 | L      | 004fef16166a853d4be13415d7606c0f87b3a3ca | 
 2013-11-01 20:38:58.289027+00 | BE & K. CONST. v. Abbott                                |    2621238 | L      | 005a88bdd566cb38aeef2dc6c3d59d021dc6c9c1 | 
 2013-11-01 21:00:58.154575+00 | State v. ALVERTO                                        |    2631472 | L      | 0083d993f083c6e6816ed9c6b048406a86b4c9f3 | 
 2013-10-30 09:16:12.859072+00 | State v. ALVERTO                                        |    2369322 | L      | 0083d993f083c6e6816ed9c6b048406a86b4c9f3 | 
 2013-11-01 20:44:37.689611+00 | Citizens for Resp. Growth v. Rci Dev't Ptr.             |    2625445 | L      | 00fd98ea6d0602a33831a17b499d7746a1578486 | 
 2013-11-01 20:53:10.018283+00 | Ahlschlager v. LAWTON SCHOOL DIST.                      |    2629040 | L      | 017a4012e489a14eaf559f3b2c371a598c239da7 | 
 2013-11-01 20:44:52.448814+00 | Hicks v. Londre                                         |    2625502 | L      | 018f120de6e1020bd1be54656634a5f21389ca68 | 
 2013-11-01 20:43:37.230223+00 | STATE EX REL. OKLAHOMA BAR ASS'N v. Edwards             |    2624756 | L      | 01c886e38f82351aac9db0ffedd0b64484663fd8 | 
 2013-11-01 20:53:19.004749+00 | State v. Vargas-Torres                                  |    2629065 | L      | 01cbbcaee92c2ed7c1981031bda9f86f1b597922 | 
 2013-10-30 09:17:06.653186+00 | Sather v. City of Spokane                               |    2374802 | L      | 02462bb2df50792f242fc0ad86779557092707d9 | 
 2013-11-01 21:01:05.517301+00 | Sather v. City of Spokane                               |    2631521 | L      | 02462bb2df50792f242fc0ad86779557092707d9 | 
 2013-11-01 20:44:17.257844+00 | Andrus v. Andrus                                        |    2625163 | L      | 0252722b1fb32178ec71584d89826a4ac29eec4e | 
 2013-10-30 09:18:04.081236+00 | St. Joseph Gen. Hosp. v. Dept. of Revenue               |    2380520 | L      | 027956fa029df58fcecf992258e8f10c9825dc01 | 
 2013-11-01 20:53:33.723249+00 | St. Joseph Gen. Hosp. v. Dept. of Revenue               |    2629148 | L      | 027956fa029df58fcecf992258e8f10c9825dc01 | 
 2013-11-01 21:31:35.696104+00 | In Re Doe                                               |    2640443 | L      | 0281fcef17c96341df58ca617517ed495c2c8ade | 
 2013-11-01 20:52:24.588853+00 | State v. Hernandez-Lopez                                |    2628565 | L      | 0290887f633b8c9c5ce437c18b7f7f68ac0ba2a1 | 
 2013-10-30 09:16:35.107471+00 | State v. Hernandez-Lopez                                |    2371613 | L      | 0290887f633b8c9c5ce437c18b7f7f68ac0ba2a1 | 
 2013-11-01 20:44:48.488829+00 | Jc v. Dungarvin Colorado, LLC                           |    2625467 | L      | 02a3ca1851cce5c7bfb71f1a150bd6f62626d88b | 
 2013-11-01 20:43:28.675387+00 | State v. Hager                                          |    2624708 | L      | 0308170b66bd55c6428b24d47285b662c4db71fe | 
 2013-10-30 09:15:44.860015+00 | State v. Hager                                          |    2366473 | L      | 0308170b66bd55c6428b24d47285b662c4db71fe | 
 2013-11-01 21:17:58.644906+00 | Horton v. Mitchell                                      |    2633377 | L      | 03141fc6fa7fbace237b631d23c52152016e1d97 | 
 2013-11-01 20:53:26.900092+00 | Rivera-Longoria v. Slayton                              |    2629095 | L      | 031ee323ddd9586044da690331a40fecc0fa5845 | 
 2013-10-30 10:56:44.052404+00 | In Re Personal Restraint Petition of Silas              |    2589978 | L      | 037c8b731717ffafc2dfd758abb0a4d2b9f8a183 | 
 2013-11-01 20:42:42.081011+00 | In Re Personal Restraint Petition of Silas              |    2624113 | L      | 037c8b731717ffafc2dfd758abb0a4d2b9f8a183 | 
 2013-10-30 11:00:56.891017+00 | Biggers v. City of Bainbridge Island                    |    2595071 | L      | 03840ddcba7bac31346068afbbd0a6e1f5433e7f | 
 2013-11-01 20:44:16.115047+00 | Biggers v. City of Bainbridge Island                    |    2625138 | L      | 03840ddcba7bac31346068afbbd0a6e1f5433e7f | 
 2013-10-30 09:18:29.525797+00 | State v. Vars                                           |    2382308 | L      | 03c98e7d91facfe15aa7a050037f3af78d453831 | 
 2013-11-01 21:01:52.675138+00 | State v. Vars                                           |    2631946 | L      | 03c98e7d91facfe15aa7a050037f3af78d453831 | 
 2013-11-01 20:42:43.061601+00 | State v. Anderson                                       |    2624122 | L      | 041d6e8588a3dd746086aee6ac00d9503bfbed65 | 
 2013-11-01 20:52:08.167247+00 | West v. Reed                                            |    2628391 | L      | 043989e79ac383c7cddedcf19fc4fd8873662e13 | 
 2013-10-30 09:14:05.407474+00 | West v. Reed                                            |    2357687 | L      | 043989e79ac383c7cddedcf19fc4fd8873662e13 | 
 2013-11-01 21:02:14.450992+00 | People v. Lynch                                         |    2632100 | L      | 0447f692fa504488982ba8572057654628564015 | 
 2013-11-01 20:44:29.958702+00 | Normandeau v. HANSON EQUIPMENT, INC.                    |    2625340 | L      | 045672790a8a933dd17d4d9b8e1e607ac336b718 | 
 2013-11-01 20:43:42.660102+00 | State v. Taylor                                         |    2624805 | L      | 04834554be2109e0a328c3c85670749f0422c911 | 
 2013-10-30 09:17:44.551268+00 | State v. Taylor                                         |    2378761 | L      | 04834554be2109e0a328c3c85670749f0422c911 | 
 2013-11-01 21:01:02.355339+00 | Smith v. Holbrook                                       |    2631506 | L      | 049f82e1c9b3a7de35a9206f716dbbe0ce6b5e0d | 
 2013-10-30 09:16:45.347923+00 | Smith v. Holbrook                                       |    2372628 | L      | 049f82e1c9b3a7de35a9206f716dbbe0ce6b5e0d | 
 2013-10-30 08:35:57.886318+00 | Jordan v. BELLEQUE                                      |    2205065 | L      | 04e1515f3074d748d1297af9f84d991bda78e92a | 
 2013-11-01 21:01:51.080407+00 | Jordan v. BELLEQUE                                      |    2631927 | L      | 04e1515f3074d748d1297af9f84d991bda78e92a | 
 2013-11-01 20:53:13.907113+00 | State v. Sievers                                        |    2629058 | L      | 054d8e3ad3dae94cafaa79d564db73996d1905fe | 
 2013-10-30 09:15:31.002335+00 | State v. Sievers                                        |    2365199 | L      | 054d8e3ad3dae94cafaa79d564db73996d1905fe | 
 2013-10-30 09:16:07.807751+00 | State v. Spradlin                                       |    2368795 | L      | 0550b671f479fa8eaa595563c57fa66f2c76b57f | 
 2013-11-01 20:43:32.352155+00 | State v. Spradlin                                       |    2624734 | L      | 0550b671f479fa8eaa595563c57fa66f2c76b57f | 
 2013-10-30 09:15:16.087202+00 | Charlton v. TOYS" R" US-DELAWARE, INC.                  |    2363612 | L      | 057f15792759a91a211174f3e54416747fb66586 | 
 2013-11-01 20:52:13.705656+00 | Charlton v. TOYS" R" US-DELAWARE, INC.                  |    2628453 | L      | 057f15792759a91a211174f3e54416747fb66586 | 

grossir avatar Aug 14 '25 18:08 grossir

Only 776 hashes have more than 1 source;

select 
    sha1, count(distinct(source))
from search_opinioncluster a 
inner join (
    select cluster_id, sha1
    from search_opinion 
    where sha1 in (
        select sha1
        from search_opinion 
        where sha1 <> '' 
        group by sha1 
        having count(*) > 1
    )
) b on b.cluster_id = a.id 
group by sha1
having count(distinct(source)) > 1
order by 2 desc
;

same-hash-duplicates-more-than-1-source.txt

grossir avatar Aug 19 '25 02:08 grossir

Source L

Ran the command for source L (lawbox) and got an error due to a bug in the merging code when trying to migrate related objects from one cluster to the other

./manage.py delete_duplicates same_hash --cluster-sources L --verbosity 3

INFO Groups to process 670 for sources ['L']
...
INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7efd345f2b60>, {'deleted opinion': 39, 'deleted cluster': 39, 'deleted docket': 39})

delete_duplicates_source_L.txt

Update

After fixing the merging bug INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7fa453362d40>, {'deleted opinion': 623, 'deleted cluster': 623, 'deleted docket': 623, 'merging error': 8})

delete-duplicates-source-l.txt

Source LU

./manage.py delete_duplicates same_hash --cluster-sources LU L --verbosity 3

INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7ff6329a2d40>, {'deleted opinion': 569, 'deleted cluster': 569, 'deleted docket': 569, 'not comparable docket': 121, 'merging error': 17})

delete-duplicates-source-lu-l.txt


Source Z

Ran the command for source Z and got a bunch of unexpected differences preventing the merge

A lot of Sentry events (filter to August 20, 2025 around 9:19 PM UTC)

A bunch of these seem to be an error in the hash assignment. See this example with 2 sub opinions that have the exact same Opinion.sha1 but actually have different content. They shouldn't have the same hash at all

  • https://www.courtlistener.com/api/rest/v4/opinions/3695705/
  • https://www.courtlistener.com/api/rest/v4/opinions/3695706/

Source G

Ran the command for source G; most were not merged due to having different docket numbers. This seems to be happening due to the same opinion existing one time for each of the consolidated dockets. This needs further analysis

See an example opinion

The clusters with the same hash opinion above

  • Reece, No. 1:19-CV-219;
  • Bamrick, No.1:19-CV-225;
  • Driscoll, No. 1:19-CV-231;
  • Gates, No. 1:19-CV-221;
  • Slowey, No. 1:19-CV-216;
  • Webber, No. 1:19-CV-220;
  • Wyman, No. 1:19-CV-215.

delete_duplicates_source_g.txt

./manage.py delete_duplicates same_hash --cluster-sources G --verbosity 3
INFO Groups to process 344 for sources ['G']

INFO defaultdict(<function Command.handle.<locals>.<lambda> at 0x7fa687eceb60>, {'not comparable docket': 505, 'same docket': 47, 'deleted opinion': 47, 'deleted cluster': 47, 'merging error': 2})

grossir avatar Aug 21 '25 14:08 grossir

Not sure why these have not been deleted yet


select 
court_id, download_url, count, nsha1
from (
select download_url, max(cluster_id) as cluster_id,count(*) as count, count(distinct(sha1)) as nsha1  
from search_opinion where download_url <> '' 
 group by download_url having count(*) > 50
) a  
inner join 
search_opinioncluster soc on cluster_id=soc.id
 inner join search_docket sd on sd.id = docket_id 
order by 2 desc;

texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55043&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion       |   143 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55019&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion       |   142 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55013&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion       |   142 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=55009&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa07%5cOpinion       |   142 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42499&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion       |    58 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42496&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion       |    58 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42467&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion       |    57 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42466&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion       |    57 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42464&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion       |    57 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=42462&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa14%5cOpinion       |    57 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24829&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24828&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24827&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24823&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24822&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24819&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24818&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24814&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texcrimapp     | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=24813&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccca%5cOpinion         |    93 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=14438&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa12%5cOpinion       |   107 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=14435&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa12%5cOpinion       |   107 |     1
 texapp         | https://search.txcourts.gov/RetrieveDocument.aspx?DocId=12447&Index=%5c%5c10%2e20%2e4%2e7%5cTamesIndexes%5ccoa10%5cOpinion       |   130 |     1
 ind            | https://public.courts.in.gov/Decisions/api/Document/Opinion?Id=CjY8lm-9eeN6IGLO9oee_CIsFS9mLqs4IDMF1VXNtvBXzkLz-mZimOUf5k2w7esT0 |   170 |     1
 ind            | https://public.courts.in.gov/Decisions/api/Document/Opinion?Id=46HRQ-DEpRSJkJne5gp7Splkf_ezdYB0et7R5x84rFkH3Psf0EG3oq5tnZngeO6v0 |   171 |     1

grossir avatar Sep 25 '25 00:09 grossir

Ran ./manage.py delete_duplicates same_hash --verbosity 3 after implementing this improvement. After this run, we are left with

{'not comparable docket': 8403, 'same docket': 1254, 'deleted opinion': 2598, 'deleted cluster': 2598, 'merging error': 16, 'deleted docket': 1344}

delete-same-hash-duplicates-log.txt

  • for SCRAPER "C" source: hard edge cases that won't merge due to having different metadata. Some of these may be corrections by the courts (16 merging errors), other may be bugs in docket assignment on our part ("not comparable docket", but since comparisons are done in combinations, the real number is lesser)

  • for other sources / mixed sources:

    • for HARVARD, we are still computing the hashes, but a lot of them will have repeated hashes due their content being just short texts
    • for HARVARD mixed sources, we still need more work

grossir avatar Oct 17 '25 20:10 grossir