h icon indicating copy to clipboard operation
h copied to clipboard

Oversized annotation payloads cause Elasticsearch to explode

Open nickstenning opened this issue 9 years ago • 3 comments

Absurdly large annotation payloads (most of the below looks like errors in scibot) cause 500 Errors in Elasticsearch.

One way to address some issues here is to impose a reasonable maximum size on the overall annotation payload and reject requests that exceed that size with a 400 error (or a 413 error if we're being fancy).

Another option is to impose maximum sizes on specific individual fields such as URL or selector data. For example, a URL over 2K characters is unlikely to work in most browsers, so rejecting URLs over 5K would seem reasonable for us.

https://app.getsentry.com/hypothesis/prod/issues/106485449/

TransportError: TransportError(500, u'IllegalArgumentException[Document contains at least one immense term in field="target.selector.exact" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: \'[10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 10, 32, 32, 32, 32, 32, 32, 32, 32]...\', original message: bytes can be at most 32766 in length; got 51481]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 51481]; ')
(16 additional frame(s) were not displayed)
...
  File "elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "elasticsearch/client/__init__.py", line 263, in index
    _make_path(index, doc_type, id), params=params, body=body)
  File "elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "elasticsearch/connection/http_urllib3.py", line 93, in perform_request
    self._raise_error(response.status, raw_data)
  File "elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)

nickstenning avatar Feb 10 '16 09:02 nickstenning

This has been happening a lot less since we stopped directly returning the annotations from Elasticsearch, and since we stopped indexing the document fields. In some of the recent occurrences the offending field was target.selector.exact, which needs to be in the search index for the text quote.

chdorner avatar Aug 05 '16 15:08 chdorner

Sentry issue: H-1FY

robertknight avatar Jul 27 '18 06:07 robertknight

This is still a problem but isn't causing that much pain because only a very small number of annotations are effective. I'm removing this from the tasks in the "Upgrade Elasticsearch" project.

robertknight avatar Sep 07 '18 10:09 robertknight