h Annotation create API crashes on long URLs (OperationalError index row size exceeds maximum)

https://sentry.io/organizations/hypothesis/issues/956191704/?project=37293&referrer=github_plugin

OperationalError: (psycopg2.OperationalError) index row size 3216 exceeds maximum 2712 for index "uq__document_uri__claimant_normalized"
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
 [SQL: 'INSERT INTO document_uri (created, updated, claimant, claimant_normalized, uri, uri_normalized, type, content_type, document_id) VALUES (%(created)s, %(updated)s, %(claimant)s, %(claimant_normalized)s, %(uri)s, %(uri_normalized)s, %(type)s, %(content_type)s, %(document_id)s) RETURNING document_uri.id'] [parameters: {'uri_normalized': u'httpx://pdf.sciencedirectassets.com/278653/1-s2.0-S1877705811X00164/1-s2.0-S1877705811029535/main.pdf?AWSAccessKeyId=ASIAQ3PHCVTYQUUTX7TQ&Expires=15 ... (1292 characters truncated) ... ZZYEEjAgiZEn3iJkF9RJF42+wzdHidDMB0zXFzcOFaRJ0+Em2wfBQkCGLh6j6WeLnBwoo5jFNLoKGLbb0JX1on4tX%2FMw%2FLSnJ2Cu38asyVXlAnviKGKaZXN+At0q04zZmLUQoNyYHEMQVQKQ=', 'updated': datetime.datetime(2019, 4, 13, 20, 5, 24, 263538), 'created'...
(29 additional frame(s) were not displayed)
...
  File "sqlalchemy/util/compat.py", line 202, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "sqlalchemy/engine/default.py", line 469, in do_execute
    cursor.execute(statement, parameters)
  File "newrelic/hooks/database_psycopg2.py", line 51, in execute
    **kwargs)
  File "newrelic/hooks/database_dbapi2.py", line 25, in execute
    *args, **kwargs)

Apr 15 '19 15:04 seanh

As I see

sa.UniqueConstraint(
            "claimant_normalized", "uri_normalized", "type", "content_type"
        )

is causing the error because claimant_normalized is often huge and is not possible to create a btree index (which postgres) does more than 2712 characters in postgres.

So the options I see are:

The simplest I think would be if we have an estimate of the maximum size of the claimant_normalized column. In that case, we can just add that length to UnicodeText and that would fix it.
If it doesn't have a maximum size and is unbounded, do we really need to keep it unique. If yes, then we need to implement some form of md5 hashing of the column and put an UniqueConstraint on the hash.
If we are using it just for indexing purpose, then the hash approach won't work. Also, if we don't really need it to be unique (which might be the case if it is a huge unbounded string) then maybe we can remove it from UniqueConstraint.

I personally feel Option 1 might be the best if we have a rough estimate of the maximum size of claimant_normalized we want to allow.

May 08 '19 22:05 SaptakS

We should avoid setting a max length on the database column as this can cause various painful issues with database migrations and the like. Instead we should enforce the maximum length at the validation level, before the data hits the database. For example in form validation or in API request validation. If necessary we can also add a @sqlalchemy.orm.validates()-decorated method to the model class so that the length will also be validated at the model level but in Python rather than in PostgreSQL. See USERNAME_MAX_LENGTH and others for examples of this.

May 09 '19 09:05 seanh

I think the unique constraint is necessary because two document URIs with the same claimant, URI, type and content type would be duplicates of each other, so this is used to prevent duplicate data. I think it should be enough to use Python-level validation prevent long claimant strings from getting to the DB in the first place.

May 09 '19 09:05 seanh

Instead we should enforce the maximum length at the validation level, before the data hits the database.

This makes sense. @seanh What would be a good max length for validation? I would like to work on this issue in that case.

May 16 '19 16:05 SaptakS

This caused a PagerDuty alert this morning, as these errors triggered a New Relic alert based on a threshold of 5xx responses to annotation creation calls within a 5 minute time window. See https://hypothes-is.slack.com/archives/C074BUPEG/p1629073392031200.

Per Sean's comments, I think the minimum useful solution here would be to pick some limits on the lengths of URLs that may be used as an annotation's main uri or included in any of the document metadata in the document field with an annotation create request and rejecting, with a 4xx response, API requests that include URIs exceeding those limits.

Where these requests come from the Hypothesis client that will trigger an error when users try to create an annotation. That will be adequate to begin with. We might be able to do something smarter on the client-side in future to drop URLs in document metadata that are over-long.

I noticed that in one of the reports an over-long URL was mostly caused by an access_token parameter with a very long value. This parameter is not important from a document identity perspective, so ideally it would be dropped by URL normalization. This then raises an issue about limits applied to normalized vs un-normalized URLs. Changing URL normalization is a complex task though.

Aug 16 '21 08:08 robertknight

h h copied to clipboard

Annotation create API crashes on long URLs (OperationalError index row size exceeds maximum)

h
h copied to clipboard