Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span'
Sentry Issue: COURTLISTENER-9AB
AttributeError: 'NoneType' object has no attribute 'span'
(15 additional frame(s) were not displayed)
...
File "cl/api/utils.py", line 350, in initial
super().initial(request, *args, **kwargs)
File "cl/api/utils.py", line 599, in allow_request
self.throttle_request_by_citation_count(request, view)
File "cl/api/utils.py", line 569, in throttle_request_by_citation_count
self.save_citation_count(request, view)
File "cl/api/utils.py", line 576, in save_citation_count
citation_count = self.get_citation_count_from_request(request, view)
File "cl/api/utils.py", line 521, in get_citation_count_from_request
eyecite.get_citations(text, tokenizer=HYPERSCAN_TOKENIZER)
Six instances so far.
Filed by @mlissner
Eduardo, can you analyze this, please, to see how big a deal it is?
I take that back -
Ah ha. Let's put it on your backlog then, instead of Eduardo's. How big an impact do you think this has?
I did a quick look and it looks eyecite related - but Im not sure its related to what I just added yesterday, but as its eyecite I think we should investigate
Just to complete the thought, I noticed that the citation lookup yesterday didnt convert the reporter to the corrected reporter which is necessary to do a proper citation lookup.
This looks like a bug in Eyecite, potentially due to a reporters-db regex pattern or old timey citation. However, the stack trace doesn’t include the citation being processed, making it unclear how to fix. We need a way to capture the failing input for further debugging.
In the meantime, I am going to move this issue to eyecite
I've tested a few variants, volume nominative's non standard volumes as well but nothing to replicate the bug yet.
Thanks Bill. It sounds like since it's an eyecite bug, that's your domain, but should we also open a bug for the API not looking things up properly?
From this related Sentry issue I got a reproducible example. Seems to be a Hyperscan error due to a corrupted document. Will look for more examples; but maybe the user is introducing some strange characters?
from eyecite import get_citations, clean_text
from eyecite.tokenizers import HyperscanTokenizer
import requests
HYPERSCAN_TOKENIZER = HyperscanTokenizer(cache_dir=".hyperscan")
r = requests.get("https://www.courtlistener.com/api/rest/v4/recap-documents/429621284", headers={"Authorization": f"Token {token}"})
document = r.json()
text = document['plain_text']
cleaned_text = clean_text(text, ["all_whitespace"])
# this fails with AttributeError: 'NoneType' object has no attribute 'span'
citations = get_citations(
cleaned_text, tokenizer=HYPERSCAN_TOKENIZER
)
# these don't fail
citations = get_citations(cleaned_text)
citations = get_citations(cleaned_text[:1128312], tokenizer=HYPERSCAN_TOKENIZER)
# the document's text after the failing index has a bunch of binary like characters?
# if you fish into the exception using %pdb, you can get the offset character where this is failing
# it's 1128312
In [34]: cleaned_text[1128312:1128312+300]
Out[34]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c \u202bڋ\u202a-*%\x0f\x10\x04\x05%\x08V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0.\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bhǦ\u038b\u202cڋ\u202a- V\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?%\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bşİ\u038b\u202cڋ\u202a-3V\u202c\u202c \u202bڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038b\x8fĚ\u038b Ȉ\u202cڋ\u202a-\x01V\u202c\u202c \u202bڋ\u202a- \x06\x06\x18 \x013V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ \u202a\x17"0AH\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038bwİ\u038b Ȉ\u202cڋ\u202a-JV\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?'
Interesting. Could be somebody looking for vulnerabilities by sending us weird stuff. I guess if this only happens with wacky code like this it'd be nice to put in a little fix if that's possible, but if's only with bad input and fixing it is hard, maybe we just ignore it completely.
the real issue is that we are failing ourselves by allowing unprintable characters to get combined into a citation in the first place.
Sentry Issue: COURTLISTENER-739
This one comes from RecapDocuments; Bill took a look and found they weird characters came from the scanned parts
https://www.courtlistener.com/docket/68197600/1/united-states-v-cellular-telephone-assigned-number-414-629-4401/ https://www.courtlistener.com/docket/4328332/10595/39/in-re-terrorist-attacks-on-september-11-2001/
All of them have scanned parts; that have been extracted as weird characters
[
# recap document id, offset
(424646788, 1368218),
(426392057, 12402),
(384413229, 10782)
]
Sentry Issue: COURTLISTENER-8YJ
This one comes from a minimal example; it breaks the HyperscanTokenizer, but not the default one
get_citations("Shady Grove Farms \xa0v Goldsmith Seeds. 1981", tokenizer=HYPERSCAN_TOKENIZER)
This PR should fix this
https://github.com/freelawproject/eyecite/pull/235