Update Citation model's full span and regexes to account for ReferenceCitation overlaps

Open grossir opened this issue 1 year ago • 1 comments

With the introduction of ReferenceCitations we noticed they sometimes overlapped with other citation models.

Given that References may be a standalone name As seen in Roe, ... or a name pincite combination As seen in Roe at 223, a reference extraction that does not take into account other citation models may incorrectly extract references that are actually part of the fuller citation models.

Currently, this is managed by eyecite.helpers.filter_citations, but we have been running into bugs due to not having correct full span calculations; or due to having incomplete extractors

overlap with supra

From Example 1

overlap with supra citation Twombly, supra, at 553-554

A Reference would be found inside of the Supra due to incomplete full span calculation: https://github.com/freelawproject/eyecite/blob/32ee7566aa079d7285560bdf3e77557740a5fa63/eyecite/find.py#L313-L324

overlap with short case citation

From Example 1

overlap with ShortCaseCitation Twombly, 550 U. S. ( I think this has been solved recently)

overlap with single-name and pincite full case citation

Example 2:

Nobelman at 332, 113 S.Ct. 2106 is actually a pincited case citation (?); currently we would identify it as a Reference followed by: a full citation or maybe a short case citation

overlap with single name full case citation

From example 1

Not strictly related to References, but to parallel citations; this should probably be split into another issue; but I am pointing it here to be added as test cases that we will know will fail

Example

State v. Howard, supra 128-129, 539 A.2d 1203. is a single citation that lists all the parallels, but our system will recognize it as a SupraCitation followed by a CaseCitation

On the same example, something similar happens with an IdCitation and parallel citations

Feb 11 '25 20:02 grossir

I added a logger.error for unknown overlap types; this is bringing in some clues on new citation formats

New overlap type: IdCitation with FullCaseCitation; when the correct citation type would be a FullCaseIdCitation

From this opinion: ...are material. See id. at 248, 106 S. Ct. 2505. A dispute...

The key to reading this is the FullCaseCitation.metadata.defendant. That field is only populated by helpers.add_defendant. In this case, it is finding a stopword in the "See" token before the "id". I think the whole string should be a single citation; but we don't support that with our current model

[
FullCaseCitation('106 S.Ct. 2505', groups={'volume': '106', 'reporter': 'S.Ct.', 'page': '2505'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff=None, defendant='id. at 248', extra=None, antecedent_guess=None, resolved_case_name_short=None, resolved_case_name=None)),

 IdCitation('id.', metadata=IdCitation.Metadata(parenthetical=None, pin_cite='at 248')),
]

Another overlap on the same opinion, a FullCaseCitation with a ShortCaseCitation; in this case, they are actually a parallel citation. Again, the behavior comes from helpers.add_defendant

was pretextual. See McDonnell Douglas, 411 U.S. at 804, 93 S. Ct. 1817

FullCaseCitation('93 S.Ct. 1817', groups={'volume': '93', 'reporter': 'S.Ct.', 'page': '1817'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff=None, defendant='McDonnell Douglas, 411 U.S. at 804', extra=None, antecedent_guess=None, resolved_case_name_short=None, resolved_case_name=None)). 

ShortCaseCitation('411 U.S. at 804', groups={'volume': '411', 'reporter': 'U.S.', 'page': '804'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='804', year=None, court='scotus', antecedent_guess='Douglas'))

Feb 14 '25 21:02 grossir