augur Collect_events Error: deadlock detected

We are seeing over 100 of these

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1938, in _exec_single_context
    self.dialect.do_executemany(
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 794, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 1969098 waits for ShareLock on transaction 145676170; blocked by process 1969078.
Process 1969078 waits for ShareLock on transaction 145676175; blocked by process 1969098.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (2423229,6) in relation "contributors"
SQL statement "SELECT 1 FROM ONLY "augur_data"."contributors" x WHERE "cntrb_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/augur/augur/tasks/github/events.py", line 46, in collect_events
    collection_strategy.collect(repo_git, key_auth, core_data_last_collected)
  File "/augur/augur/tasks/github/events.py", line 124, in collect
    self._process_events(events, repo_id)
  File "/augur/augur/tasks/github/events.py", line 167, in _process_events
    update_issue_closed_cntrbs_by_repo_id(repo_id)
  File "/augur/augur/application/db/lib.py", line 567, in update_issue_closed_cntrbs_by_repo_id
    connection.execute(update_stmt, update_data)
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
    return meth(
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 516, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
    ret = self._execute_context(
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
    return self._exec_single_context(
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
    self._handle_dbapi_exception(
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2343, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1938, in _exec_single_context
    self.dialect.do_executemany(
  File "/opt/venv/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 794, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 1969098 waits for ShareLock on transaction 145676170; blocked by process 1969078.
Process 1969078 waits for ShareLock on transaction 145676175; blocked by process 1969098.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (2423229,6) in relation "contributors"
SQL statement "SELECT 1 FROM ONLY "augur_data"."contributors" x WHERE "cntrb_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

[SQL: 
                UPDATE issues
                SET cntrb_id = %(cntrb_id)s
                WHERE issue_id = %(issue_id)s
                AND repo_id = %(repo_id)s
            ]
[parameters: [{'cntrb_id': UUID('0105275d-6400-0000-0000-000000000000'), 'issue_id': 24652500, 'repo_id': 64768}, {'cntrb_id': UUID('0102bfea-9500-0000-0000-000000000000'), 'issue_id': 24652501, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652502, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652503, 'repo_id': 64768}, {'cntrb_id': UUID('0101e0c8-0a00-0000-0000-000000000000'), 'issue_id': 24652505, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652507, 'repo_id': 64768}, {'cntrb_id': UUID('0105275d-6400-0000-0000-000000000000'), 'issue_id': 24652508, 'repo_id': 64768}, {'cntrb_id': UUID('01017e42-7700-0000-0000-000000000000'), 'issue_id': 24652509, 'repo_id': 64768}  ... displaying 10 of 685 total bound parameter sets ...  {'cntrb_id': UUID('01000488-dd00-0000-0000-000000000000'), 'issue_id': 228519090, 'repo_id': 64768}, {'cntrb_id': UUID('01000488-dd00-0000-0000-000000000000'), 'issue_id': 230637437, 'repo_id': 64768}]]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Jun 05 '25 13:06 cdolfi

This is an issue with large instances of Augur that are collecting on more than 20 repositories simultaneously. We do not think changes to the general Augur architecture are necessary.

Perhaps some instructions on hash partitioning the contributors table, and possibly locating it on a different physical disk, is a useful approach for very large Augur instances. This strategy is working in testing with Augur's largest instance for testing (which we think is also the largest instance).

@cdolfi : Let me know if you want that Hash partitioning strategy in the docs to close this, or what other resolution you are thinking about.

Jun 08 '25 22:06 sgoggins

Deadlocks are a logical level conflict. Hash partitioning is a strategy to optimize physical storage, as is the use of tablespaces. As a solution, it's unrelated to the deadlock problem. It makes the problem less bad as a side effect -- if stuff updates faster, there's fewer conflicts -- but at the cost of a lot of complexity.

The solution to this is to look at how entities are being locked and updated and take steps to improve the workers to conflict less.

Jun 10 '25 17:06 jberkus

It was found that the events collection has the ability to lock an uncapped number of rows during updates. This operation locks the contributor table via a secondary relation, and so seems prone to this kind of issue.

We have already refactored other workers that interface with the contributors table so that they batch their updates, but this task has not yet been updated. At Monday's meeting, we discussed refactoring the events collection to also batch updates and include retries with deadlock detection and rollbacks.

Jun 10 '25 21:06 Ulincsys

Are the locks on the contributor table due to actual updates to the table, or strictly because of the foreign key?

Jun 10 '25 23:06 jberkus

I am not super familiar with this set of tasks, so I only know a little bit about that operation. To the best of my knowledge, it's just the foreign key that is involved in the update.

Jun 10 '25 23:06 Ulincsys

If it's just the FK ... have we considered removing the FKs? It's not uncommon to do so in data warehousing applications, partly for this reason (and partly for performance). The tradeoff is that we'd need to be very sure that the workers wouldn't add orphan rows. But it would cure this deadlocking issue, and probably others that are less prominent.

Jun 10 '25 23:06 jberkus

I haven't heard that suggestion come up yet. That may be a question for @sgoggins or @ABrain7710

Jun 11 '25 00:06 Ulincsys