Custom or Context-Sensitive Reporters
In my home state, the Commonwealth of Kentucky, appellate (and trial) cases do not cite either Michie's or Baldwin's like, say, Ky. Rev. Stat. Ann. § 021.48. Almost always, they cite it as KRS, perhaps with a footnote explaining it is short for Kentucky Revised States, and without the section sign. Beyond, say, forking reporters-db, what would be a way to quickly tell eyecite to parse such citations? I have several other, for lack of a better term, "custom" or "context-sensitive" reporters, such as for Kentucky Open Records Decisions, where the syntax is typically <YY>-ORD-<NNN>.
This is not documented behavior, but I think you should be able to inject your custom regexes into the tokenizer pipeline. Something like:
from tokenizers import EXTRACTORS
from models import Token
# inject your custom regexes
EXTRACTORS.extend(
[
TokenExtractor(
r'YOUR_CUSTOM_REGEX', # e.g. r'\d{2}-ORD-\d{3}'
Token.from_match,
flags=re.I
)
]
)
# re-create the tokenizer with these regexes now included
default_tokenizer = AhocorasickTokenizer() # or...
default_tokenizer = HyperscanTokenizer()
# now just use eyecite like normal
get_citations(text, tokenizer=default_tokenizer)
Not 100% sure this will work, but I would suggest playing around with something like this!