h
h copied to clipboard
Annotation create API crashes on long URLs (OperationalError index row size exceeds maximum)
https://sentry.io/organizations/hypothesis/issues/956191704/?project=37293&referrer=github_plugin
OperationalError: (psycopg2.OperationalError) index row size 3216 exceeds maximum 2712 for index "uq__document_uri__claimant_normalized"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
[SQL: 'INSERT INTO document_uri (created, updated, claimant, claimant_normalized, uri, uri_normalized, type, content_type, document_id) VALUES (%(created)s, %(updated)s, %(claimant)s, %(claimant_normalized)s, %(uri)s, %(uri_normalized)s, %(type)s, %(content_type)s, %(document_id)s) RETURNING document_uri.id'] [parameters: {'uri_normalized': u'httpx://pdf.sciencedirectassets.com/278653/1-s2.0-S1877705811X00164/1-s2.0-S1877705811029535/main.pdf?AWSAccessKeyId=ASIAQ3PHCVTYQUUTX7TQ&Expires=15 ... (1292 characters truncated) ... ZZYEEjAgiZEn3iJkF9RJF42+wzdHidDMB0zXFzcOFaRJ0+Em2wfBQkCGLh6j6WeLnBwoo5jFNLoKGLbb0JX1on4tX%2FMw%2FLSnJ2Cu38asyVXlAnviKGKaZXN+At0q04zZmLUQoNyYHEMQVQKQ=', 'updated': datetime.datetime(2019, 4, 13, 20, 5, 24, 263538), 'created'...
(29 additional frame(s) were not displayed)
...
File "sqlalchemy/util/compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "sqlalchemy/engine/base.py", line 1182, in _execute_context
context)
File "sqlalchemy/engine/default.py", line 469, in do_execute
cursor.execute(statement, parameters)
File "newrelic/hooks/database_psycopg2.py", line 51, in execute
**kwargs)
File "newrelic/hooks/database_dbapi2.py", line 25, in execute
*args, **kwargs)
As I see
sa.UniqueConstraint(
"claimant_normalized", "uri_normalized", "type", "content_type"
)
is causing the error because claimant_normalized is often huge and is not possible to create a btree index (which postgres) does more than 2712 characters in postgres.
So the options I see are:
- The simplest I think would be if we have an estimate of the maximum size of the
claimant_normalized
column. In that case, we can just add that length to UnicodeText and that would fix it. - If it doesn't have a maximum size and is unbounded, do we really need to keep it unique. If yes, then we need to implement some form of md5 hashing of the column and put an UniqueConstraint on the hash.
- If we are using it just for indexing purpose, then the hash approach won't work. Also, if we don't really need it to be unique (which might be the case if it is a huge unbounded string) then maybe we can remove it from UniqueConstraint.
I personally feel Option 1 might be the best if we have a rough estimate of the maximum size of claimant_normalized we want to allow.
We should avoid setting a max length on the database column as this can cause various painful issues with database migrations and the like. Instead we should enforce the maximum length at the validation level, before the data hits the database. For example in form validation or in API request validation. If necessary we can also add a @sqlalchemy.orm.validates()
-decorated method to the model class so that the length will also be validated at the model level but in Python rather than in PostgreSQL. See USERNAME_MAX_LENGTH
and others for examples of this.
I think the unique constraint is necessary because two document URIs with the same claimant, URI, type and content type would be duplicates of each other, so this is used to prevent duplicate data. I think it should be enough to use Python-level validation prevent long claimant strings from getting to the DB in the first place.
Instead we should enforce the maximum length at the validation level, before the data hits the database.
This makes sense. @seanh What would be a good max length for validation? I would like to work on this issue in that case.
This caused a PagerDuty alert this morning, as these errors triggered a New Relic alert based on a threshold of 5xx responses to annotation creation calls within a 5 minute time window. See https://hypothes-is.slack.com/archives/C074BUPEG/p1629073392031200.
Per Sean's comments, I think the minimum useful solution here would be to pick some limits on the lengths of URLs that may be used as an annotation's main uri
or included in any of the document metadata in the document
field with an annotation create request and rejecting, with a 4xx response, API requests that include URIs exceeding those limits.
Where these requests come from the Hypothesis client that will trigger an error when users try to create an annotation. That will be adequate to begin with. We might be able to do something smarter on the client-side in future to drop URLs in document metadata that are over-long.
I noticed that in one of the reports an over-long URL was mostly caused by an access_token
parameter with a very long value. This parameter is not important from a document identity perspective, so ideally it would be dropped by URL normalization. This then raises an issue about limits applied to normalized vs un-normalized URLs. Changing URL normalization is a complex task though.