eyecite icon indicating copy to clipboard operation
eyecite copied to clipboard

Custom or Context-Sensitive Reporters

Open nmccamish opened this issue 4 months ago • 1 comments

In my home state, the Commonwealth of Kentucky, appellate (and trial) cases do not cite either Michie's or Baldwin's like, say, Ky. Rev. Stat. Ann. § 021.48. Almost always, they cite it as KRS, perhaps with a footnote explaining it is short for Kentucky Revised States, and without the section sign. Beyond, say, forking reporters-db, what would be a way to quickly tell eyecite to parse such citations? I have several other, for lack of a better term, "custom" or "context-sensitive" reporters, such as for Kentucky Open Records Decisions, where the syntax is typically <YY>-ORD-<NNN>.

nmccamish avatar Nov 06 '25 20:11 nmccamish

This is not documented behavior, but I think you should be able to inject your custom regexes into the tokenizer pipeline. Something like:

from tokenizers import EXTRACTORS
from models import Token

# inject your custom regexes
EXTRACTORS.extend(
    [
        TokenExtractor(
            r'YOUR_CUSTOM_REGEX',  # e.g. r'\d{2}-ORD-\d{3}'
            Token.from_match,
            flags=re.I
        )
    ]
)

# re-create the tokenizer with these regexes now included
default_tokenizer = AhocorasickTokenizer() # or...
default_tokenizer = HyperscanTokenizer()

# now just use eyecite like normal
get_citations(text, tokenizer=default_tokenizer)

Not 100% sure this will work, but I would suggest playing around with something like this!

mattdahl avatar Nov 07 '25 15:11 mattdahl