h
h copied to clipboard
Oversized annotation payloads cause Elasticsearch to explode
Absurdly large annotation payloads (most of the below looks like errors in scibot) cause 500 Errors in Elasticsearch.
One way to address some issues here is to impose a reasonable maximum size on the overall annotation payload and reject requests that exceed that size with a 400 error (or a 413 error if we're being fancy).
Another option is to impose maximum sizes on specific individual fields such as URL or selector data. For example, a URL over 2K characters is unlikely to work in most browsers, so rejecting URLs over 5K would seem reasonable for us.
https://app.getsentry.com/hypothesis/prod/issues/106485449/
TransportError: TransportError(500, u'IllegalArgumentException[Document contains at least one immense term in field="target.selector.exact" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: \'[10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 10, 32, 32, 32, 32, 32, 32, 32, 32]...\', original message: bytes can be at most 32766 in length; got 51481]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 51481]; ')
(16 additional frame(s) were not displayed)
...
File "elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "elasticsearch/client/__init__.py", line 263, in index
_make_path(index, doc_type, id), params=params, body=body)
File "elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
This has been happening a lot less since we stopped directly returning the annotations from Elasticsearch, and since we stopped indexing the document fields.
In some of the recent occurrences the offending field was target.selector.exact
, which needs to be in the search index for the text quote.
Sentry issue: H-1FY
This is still a problem but isn't causing that much pain because only a very small number of annotations are effective. I'm removing this from the tasks in the "Upgrade Elasticsearch" project.