awx icon indicating copy to clipboard operation
awx copied to clipboard

psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly

Open sli68 opened this issue 1 year ago • 5 comments

Please confirm the following

  • [X] I agree to follow this project's code of conduct.
  • [X] I have checked the current issues for duplicates.
  • [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

I user AWX 24.3.1. and external Postgres (postgres (PostgreSQL) 15.2) Every 5 minutes, the awx-web pod log have the following error: 2024-11-13 15:20:41,270 ERROR [-] awx.main.cache_clear Encountered unhandled error in cache clear main loop Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_cache_clear.py", line 24, in handle for e in conn.events(yield_timeouts=True): File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 90, in events for notification in notification_generator: File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 74, in current_notifies raise ex.with_traceback(None) psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Traceback (most recent call last): File "/usr/bin/awx-manage", line 8, in sys.exit(manage()) ^^^^^^^^ File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/init.py", line 177, in manage execute_from_command_line(sys.argv) File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/init.py", line 442, in execute_from_command_line utility.execute() File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/init.py", line 436, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 412, in run_from_argv self.execute(*args, **cmd_options) File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 458, in execute output = self.handle(*args, **options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_cache_clear.py", line 24, in handle for e in conn.events(yield_timeouts=True): File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 90, in events for notification in notification_generator: File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 74, in current_notifies raise ex.with_traceback(None) psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2024-11-13 15:20:41,888 WARN exited: awx-cache-clear (exit status 1; not expected) 2024-11-13 15:20:41,888 WARN exited: awx-cache-clear (exit status 1; not expected)

AWX version

24.3.1

Select the relevant components

  • [ ] UI
  • [ ] UI (tech preview)
  • [ ] API
  • [ ] Docs
  • [ ] Collection
  • [ ] CLI
  • [ ] Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

deploy awx with external Postgres

Expected results

no error log

Actual results

2024-11-13 15:20:41,270 ERROR [-] awx.main.cache_clear Encountered unhandled error in cache clear main loop Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_cache_clear.py", line 24, in handle for e in conn.events(yield_timeouts=True): File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 90, in events for notification in notification_generator: File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 74, in current_notifies raise ex.with_traceback(None) psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Traceback (most recent call last): File "/usr/bin/awx-manage", line 8, in sys.exit(manage()) ^^^^^^^^ File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/init.py", line 177, in manage execute_from_command_line(sys.argv) File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/init.py", line 442, in execute_from_command_line utility.execute() File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/init.py", line 436, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 412, in run_from_argv self.execute(*args, **cmd_options) File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 458, in execute output = self.handle(*args, **options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_cache_clear.py", line 24, in handle for e in conn.events(yield_timeouts=True): File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 90, in events for notification in notification_generator: File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 74, in current_notifies raise ex.with_traceback(None) psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2024-11-13 15:20:41,888 WARN exited: awx-cache-clear (exit status 1; not expected) 2024-11-13 15:20:41,888 WARN exited: awx-cache-clear (exit status 1; not expected)

Additional information

No response

sli68 avatar Nov 13 '24 15:11 sli68

We see the same thing, running AWX 24.6.1, also using an external Postgres DB (15.8)

We also see the message posted above from awx-web/rsyslog cache clear main loop.

In addition to that we see from awx-web/awx-web rsyslog_configurer main loop (probably the same underlying problem):

2024-12-13 06:37:55,451 ERROR [-] awx.main.rsyslog_configurer Encountered unhandled error in rsyslog_configurer main loop
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_rsyslog_configurer.py", line 29, in handle
for e in conn.events(yield_timeouts=True):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 90, in events
for notification in notification_generator:
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 74, in current_notifies
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: Connection timed out
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
^^^^^^^^
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/__init__.py", line 177, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/__init__.py", line 436, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 412, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 458, in execute
output = self.handle(*args, **options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_rsyslog_configurer.py", line 29, in handle
for e in conn.events(yield_timeouts=True):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 90, in events
for notification in notification_generator:
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 74, in current_notifies
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: Connection timed out

2and3makes23 avatar Dec 13 '24 06:12 2and3makes23

Possible source of problem

Looking at DB connections from a firewall point of view might reveal the source of the problem here: With our setup idle sessions expire/are closed by firewall after 1h without any traffic.

What seems to happen is, that AWX tries to reuse those already closed connections over and over again. All attempts like this fail like stated above.

What didn´t help in our case was to configure the firewall to send a reset package to DB and AWX when closing "stale" sessions, in hope, AWX would then go ahead and establish a new session when needed next (which it currently just doesn´t do): grafik

Possible fixes

  • some kind of keepalive package (ideally with configurable interval) to prevent the firewall from closing connections. (maybe is there aleady an option in place I´m not aware of?)
  • making AWX create new sessions if last used session is no longer open (if that does not produce too much overhead or concurrency issues).
  • Set a max session age on AWX side, see related issue #13481

Related issues

  • https://github.com/ansible/awx/issues/13481

2and3makes23 avatar Dec 16 '24 11:12 2and3makes23

For now we used a work around* by setting up TCP keepalives on DB side to prevent open sessions from getting stale, which solves the issue for us.

*Would switch to an application side fix via CONN_MAX_AGE or similar, as soon as available, to have everything required to run AWX cleanly in one place.

@ AWX Team Thanks for all your great work on the AWX project! 🎄 🎄 🎄 :-)

2and3makes23 avatar Dec 19 '24 14:12 2and3makes23

Hi @2and3makes23, thank you so much for sharing your work around! We appreciate your time and involvement with the AWX project :D

@sli68 were you able to utilize the workaround @2and3makes23 provided?

djyasin avatar May 29 '25 15:05 djyasin

verified, not work on our side

sli68 avatar Sep 17 '25 15:09 sli68