eyecite icon indicating copy to clipboard operation
eyecite copied to clipboard

Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span'

Open sentry[bot] opened this issue 1 year ago • 13 comments

Sentry Issue: COURTLISTENER-9AB

AttributeError: 'NoneType' object has no attribute 'span'
(15 additional frame(s) were not displayed)
...
  File "cl/api/utils.py", line 350, in initial
    super().initial(request, *args, **kwargs)
  File "cl/api/utils.py", line 599, in allow_request
    self.throttle_request_by_citation_count(request, view)
  File "cl/api/utils.py", line 569, in throttle_request_by_citation_count
    self.save_citation_count(request, view)
  File "cl/api/utils.py", line 576, in save_citation_count
    citation_count = self.get_citation_count_from_request(request, view)
  File "cl/api/utils.py", line 521, in get_citation_count_from_request
    eyecite.get_citations(text, tokenizer=HYPERSCAN_TOKENIZER)

Six instances so far.

Filed by @mlissner

sentry[bot] avatar Feb 27 '25 01:02 sentry[bot]

Eduardo, can you analyze this, please, to see how big a deal it is?

mlissner avatar Feb 27 '25 01:02 mlissner

I take that back -

flooie avatar Feb 27 '25 18:02 flooie

Ah ha. Let's put it on your backlog then, instead of Eduardo's. How big an impact do you think this has?

mlissner avatar Feb 27 '25 18:02 mlissner

I did a quick look and it looks eyecite related - but Im not sure its related to what I just added yesterday, but as its eyecite I think we should investigate

flooie avatar Feb 27 '25 18:02 flooie

Just to complete the thought, I noticed that the citation lookup yesterday didnt convert the reporter to the corrected reporter which is necessary to do a proper citation lookup.

flooie avatar Feb 27 '25 18:02 flooie

This looks like a bug in Eyecite, potentially due to a reporters-db regex pattern or old timey citation. However, the stack trace doesn’t include the citation being processed, making it unclear how to fix. We need a way to capture the failing input for further debugging.

In the meantime, I am going to move this issue to eyecite

I've tested a few variants, volume nominative's non standard volumes as well but nothing to replicate the bug yet.

flooie avatar Feb 27 '25 21:02 flooie

Thanks Bill. It sounds like since it's an eyecite bug, that's your domain, but should we also open a bug for the API not looking things up properly?

mlissner avatar Feb 28 '25 00:02 mlissner

From this related Sentry issue I got a reproducible example. Seems to be a Hyperscan error due to a corrupted document. Will look for more examples; but maybe the user is introducing some strange characters?

from eyecite import get_citations, clean_text
from eyecite.tokenizers import HyperscanTokenizer
import requests
HYPERSCAN_TOKENIZER = HyperscanTokenizer(cache_dir=".hyperscan")


r = requests.get("https://www.courtlistener.com/api/rest/v4/recap-documents/429621284", headers={"Authorization": f"Token {token}"})
document = r.json()
text = document['plain_text']
cleaned_text = clean_text(text, ["all_whitespace"])

# this fails with AttributeError: 'NoneType' object has no attribute 'span'
citations = get_citations(
        cleaned_text, tokenizer=HYPERSCAN_TOKENIZER
    )

# these don't fail
citations = get_citations(cleaned_text)
citations = get_citations(cleaned_text[:1128312], tokenizer=HYPERSCAN_TOKENIZER)

# the document's text after the failing index has a bunch of binary like characters?
# if you fish into the exception using %pdb, you can get the offset character where this is failing
# it's 1128312
In [34]: cleaned_text[1128312:1128312+300]
Out[34]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c \u202bڋ\u202a-*%\x0f\x10\x04\x05%\x08V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0.\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bhǦ\u038b\u202cڋ\u202a- V\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?%\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bşİ\u038b\u202cڋ\u202a-3V\u202c\u202c \u202bڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038b\x8fĚ\u038b Ȉ\u202cڋ\u202a-\x01V\u202c\u202c \u202bڋ\u202a- \x06\x06\x18 \x013V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ \u202a\x17"0AH\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038bwİ\u038b Ȉ\u202cڋ\u202a-JV\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?'

grossir avatar Feb 28 '25 16:02 grossir

Interesting. Could be somebody looking for vulnerabilities by sending us weird stuff. I guess if this only happens with wacky code like this it'd be nice to put in a little fix if that's possible, but if's only with bad input and fixing it is hard, maybe we just ignore it completely.

mlissner avatar Feb 28 '25 16:02 mlissner

the real issue is that we are failing ourselves by allowing unprintable characters to get combined into a citation in the first place.

flooie avatar Feb 28 '25 18:02 flooie

Sentry Issue: COURTLISTENER-739

This one comes from RecapDocuments; Bill took a look and found they weird characters came from the scanned parts

https://www.courtlistener.com/docket/68197600/1/united-states-v-cellular-telephone-assigned-number-414-629-4401/ https://www.courtlistener.com/docket/4328332/10595/39/in-re-terrorist-attacks-on-september-11-2001/

All of them have scanned parts; that have been extracted as weird characters

[
# recap document id, offset
(424646788, 1368218),
(426392057, 12402),
(384413229, 10782)
]

sentry[bot] avatar Feb 28 '25 18:02 sentry[bot]

Sentry Issue: COURTLISTENER-8YJ

This one comes from a minimal example; it breaks the HyperscanTokenizer, but not the default one

get_citations("Shady Grove Farms \xa0v Goldsmith Seeds. 1981", tokenizer=HYPERSCAN_TOKENIZER)

sentry[bot] avatar Feb 28 '25 18:02 sentry[bot]

This PR should fix this

https://github.com/freelawproject/eyecite/pull/235

flooie avatar Mar 04 '25 18:03 flooie