awx icon indicating copy to clipboard operation
awx copied to clipboard

[tech-debt][performance] Avoid contextually inappropriate references to UJT.last_job

Open AlanCoding opened this issue 2 years ago • 8 comments

SUMMARY

Related to https://github.com/ansible/awx/issues/13512

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

AlanCoding avatar Feb 10 '23 16:02 AlanCoding

This is very similar in nature to https://github.com/ansible/awx/pull/13124, but attacking a different problem. The shared theme is fighting the entropy of unnecessarily complex and confusing call stacks when saves to jobs and their related (polymorphic) templates.

AlanCoding avatar Feb 10 '23 16:02 AlanCoding

I really like this patch. This would be a meaningful tech debt win, and it's very unlikely that other planned work will overlap with what this is changing because it's at the level of the core job models.

AlanCoding avatar Feb 20 '24 14:02 AlanCoding

Failing CI items, converted to draft

dmzoneill avatar Feb 21 '24 14:02 dmzoneill

I found these logs from integration tests. The unexpected updates to parent instances are a real problem. The deadlock here is extremely concerning, because it happens in canceling/shutdown events and we have a lot of risk there for broader system deadlock. While this PR might need more testing, this is one of the best things we can do to avoid these situations. A canceled job needs to update the job record and exit fast, and this blocks that.


2024-02-22 04:42:58,862 WARNING  [cc896a75fa9245ed93332279170e429b] awx.main.dispatch Sending SIGTERM to task id=6163b146-2eff-4219-8713-6f07f23d6b57, task=awx.main.tasks.jobs.RunJob, args=[2152]
2024-02-22 04:42:58,862 WARNING  [cc896a75fa9245ed93332279170e429b] awx.main.dispatch Received control signal:
{'control': 'cancel', 'task_ids': ['0ef11231-933d-4712-bd19-256db51903f1'], 'reply_to': None}
2024-02-22 04:42:58,863 WARNING  [cc896a75fa9245ed93332279170e429b] awx.main.dispatch Sending SIGTERM to task id=0ef11231-933d-4712-bd19-256db51903f1, task=awx.main.tasks.jobs.RunJob, args=[2153]
2024-02-22 04:42:59,181 ERROR    [6e31ac6539574c2b926687c67802eadc] awx.main.dispatch Worker failed to run task awx.main.tasks.jobs.RunJob(*[2153], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/cursor.py", line 723, in execute
    raise ex.with_traceback(None)
psycopg.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 593369 waits for ShareLock on transaction 138247; blocked by process 593371.
Process 593371 waits for ShareLock on transaction 138252; blocked by process 593369.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2,49) in relation "main_unifiedjobtemplate"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/update_model.py", line 34, in update_model
    instance.save(update_fields=update_fields)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 926, in save
    self._update_parent_instance()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 844, in _update_parent_instance
    parent_instance.save(update_fields=update_fields)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/jobs.py", line 356, in save
    return super(JobTemplate, self).save(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 291, in save
    super(UnifiedJobTemplate, self).save(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/polymorphic/models.py", line 87, in save
    return super().save(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 344, in save
    super(PrimordialModel, self).save(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 192, in save
    super(CreatedModifiedModel, self).save(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/mixins.py", line 612, in save
    super().save(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 814, in save
    self.save_base(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 876, in save_base
    parent_inserted = self._save_parents(cls, using, update_fields)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 918, in _save_parents
    updated = self._save_table(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 990, in _save_table
    updated = self._do_update(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 1054, in _do_update
    return filtered._update(values) > 0
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 1231, in _update
    return query.get_compiler(self.db).execute_sql(CURSOR)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1984, in execute_sql
    cursor = super().execute_sql(result_type)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1562, in execute_sql
    cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/cursor.py", line 723, in execute
    raise ex.with_traceback(None)
django.db.utils.OperationalError: deadlock detected
DETAIL:  Process 593369 waits for ShareLock on transaction 138247; blocked by process 593371.
Process 593371 waits for ShareLock on transaction 138252; blocked by process 593369.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2,49) in relation "main_unifiedjobtemplate"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 94, in _wrapped
    return f(self, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/signals.py", line 81, in _wrapped
    return f(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 484, in run
    self.instance = self.update_model(pk, status='running', start_args='')  # blank field to remove encrypted passwords
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 121, in update_model
    return update_model(self.model, pk, _attempt=0, _max_attempts=self.update_attempts, **updates)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/update_model.py", line 48, in update_model
    raise RuntimeError(f'Could not fetch {pk} because of receiving abort signal')
RuntimeError: Could not fetch 2153 because of receiving abort signal
2024-02-22 04:42:59,226 WARNING  [8d40d8cf8607417f82a3e0b30e38f019] awx.main.dispatch checking dispatcher cancel for automationcontroller-0
2024-02-22 04:42:59,237 WARNING  [406bfc42b48f43fa8c585bde8fac7bac] awx.main.dispatch Received control signal:
{'control': 'cancel', 'reply_to': 'reply_to_f9550861_5d0a_4ee9_860f_2c99b67be19c', 'task_ids': ['e1e1a67f-f5e2-43c0-9678-711acf552431']}
2024-02-22 04:42:59,244 WARNING  [406bfc42b48f43fa8c585bde8fac7bac] awx.main.dispatch Sending SIGTERM to task id=e1e1a67f-f5e2-43c0-9678-711acf552431, task=awx.main.tasks.jobs.RunAdHocCommand, args=[2169]
2024-02-22 04:42:59,342 INFO     [fc5a4e7c0574436382955763183e9e2b] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 2158

AlanCoding avatar Feb 22 '24 14:02 AlanCoding