Collect_events Error: deadlock detected
We are seeing over 100 of these
Traceback (most recent call last):
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1938, in _exec_single_context
self.dialect.do_executemany(
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 794, in do_executemany
cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL: Process 1969098 waits for ShareLock on transaction 145676170; blocked by process 1969078.
Process 1969078 waits for ShareLock on transaction 145676175; blocked by process 1969098.
HINT: See server log for query details.
CONTEXT: while locking tuple (2423229,6) in relation "contributors"
SQL statement "SELECT 1 FROM ONLY "augur_data"."contributors" x WHERE "cntrb_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/venv/lib/python3.9/site-packages/celery/app/trace.py", line 451, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/venv/lib/python3.9/site-packages/celery/app/trace.py", line 734, in __protected_call__
return self.run(*args, **kwargs)
File "/augur/augur/tasks/github/events.py", line 46, in collect_events
collection_strategy.collect(repo_git, key_auth, core_data_last_collected)
File "/augur/augur/tasks/github/events.py", line 124, in collect
self._process_events(events, repo_id)
File "/augur/augur/tasks/github/events.py", line 167, in _process_events
update_issue_closed_cntrbs_by_repo_id(repo_id)
File "/augur/augur/application/db/lib.py", line 567, in update_issue_closed_cntrbs_by_repo_id
connection.execute(update_stmt, update_data)
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
return meth(
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 516, in _execute_on_connection
return connection._execute_clauseelement(
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
ret = self._execute_context(
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
return self._exec_single_context(
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
self._handle_dbapi_exception(
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2343, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1938, in _exec_single_context
self.dialect.do_executemany(
File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 794, in do_executemany
cursor.executemany(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL: Process 1969098 waits for ShareLock on transaction 145676170; blocked by process 1969078.
Process 1969078 waits for ShareLock on transaction 145676175; blocked by process 1969098.
HINT: See server log for query details.
CONTEXT: while locking tuple (2423229,6) in relation "contributors"
SQL statement "SELECT 1 FROM ONLY "augur_data"."contributors" x WHERE "cntrb_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
[SQL:
UPDATE issues
SET cntrb_id = %(cntrb_id)s
WHERE issue_id = %(issue_id)s
AND repo_id = %(repo_id)s
]
[parameters: [{'cntrb_id': UUID('0105275d-6400-0000-0000-000000000000'), 'issue_id': 24652500, 'repo_id': 64768}, {'cntrb_id': UUID('0102bfea-9500-0000-0000-000000000000'), 'issue_id': 24652501, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652502, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652503, 'repo_id': 64768}, {'cntrb_id': UUID('0101e0c8-0a00-0000-0000-000000000000'), 'issue_id': 24652505, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652507, 'repo_id': 64768}, {'cntrb_id': UUID('0105275d-6400-0000-0000-000000000000'), 'issue_id': 24652508, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652509, 'repo_id': 64768} ... displaying 10 of 685 total bound parameter sets ... {'cntrb_id': UUID('01000488-dd00-0000-0000-000000000000'), 'issue_id': 228519090, 'repo_id': 64768}, {'cntrb_id': UUID('01000488-dd00-0000-0000-000000000000'), 'issue_id': 230637437, 'repo_id': 64768}]]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
This is an issue with large instances of Augur that are collecting on more than 20 repositories simultaneously. We do not think changes to the general Augur architecture are necessary.
Perhaps some instructions on hash partitioning the contributors table, and possibly locating it on a different physical disk, is a useful approach for very large Augur instances. This strategy is working in testing with Augur's largest instance for testing (which we think is also the largest instance).
@cdolfi : Let me know if you want that Hash partitioning strategy in the docs to close this, or what other resolution you are thinking about.
Deadlocks are a logical level conflict. Hash partitioning is a strategy to optimize physical storage, as is the use of tablespaces. As a solution, it's unrelated to the deadlock problem. It makes the problem less bad as a side effect -- if stuff updates faster, there's fewer conflicts -- but at the cost of a lot of complexity.
The solution to this is to look at how entities are being locked and updated and take steps to improve the workers to conflict less.
It was found that the events collection has the ability to lock an uncapped number of rows during updates. This operation locks the contributor table via a secondary relation, and so seems prone to this kind of issue.
We have already refactored other workers that interface with the contributors table so that they batch their updates, but this task has not yet been updated. At Monday's meeting, we discussed refactoring the events collection to also batch updates and include retries with deadlock detection and rollbacks.
Are the locks on the contributor table due to actual updates to the table, or strictly because of the foreign key?
I am not super familiar with this set of tasks, so I only know a little bit about that operation. To the best of my knowledge, it's just the foreign key that is involved in the update.
If it's just the FK ... have we considered removing the FKs? It's not uncommon to do so in data warehousing applications, partly for this reason (and partly for performance). The tradeoff is that we'd need to be very sure that the workers wouldn't add orphan rows. But it would cure this deadlocking issue, and probably others that are less prominent.
I haven't heard that suggestion come up yet. That may be a question for @sgoggins or @ABrain7710