psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
- [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to
[email protected]instead.)
Bug Summary
I user AWX 24.3.1. and external Postgres (postgres (PostgreSQL) 15.2)
Every 5 minutes, the awx-web pod log have the following error:
2024-11-13 15:20:41,270 ERROR [-] awx.main.cache_clear Encountered unhandled error in cache clear main loop
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_cache_clear.py", line 24, in handle
for e in conn.events(yield_timeouts=True):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 90, in events
for notification in notification_generator:
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 74, in current_notifies
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in
AWX version
24.3.1
Select the relevant components
- [ ] UI
- [ ] UI (tech preview)
- [ ] API
- [ ] Docs
- [ ] Collection
- [ ] CLI
- [ ] Other
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
deploy awx with external Postgres
Expected results
no error log
Actual results
2024-11-13 15:20:41,270 ERROR [-] awx.main.cache_clear Encountered unhandled error in cache clear main loop
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_cache_clear.py", line 24, in handle
for e in conn.events(yield_timeouts=True):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 90, in events
for notification in notification_generator:
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/init.py", line 74, in current_notifies
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in
Additional information
No response
We see the same thing, running AWX 24.6.1, also using an external Postgres DB (15.8)
We also see the message posted above from awx-web/rsyslog cache clear main loop.
In addition to that we see from awx-web/awx-web rsyslog_configurer main loop (probably the same underlying problem):
2024-12-13 06:37:55,451 ERROR [-] awx.main.rsyslog_configurer Encountered unhandled error in rsyslog_configurer main loop
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_rsyslog_configurer.py", line 29, in handle
for e in conn.events(yield_timeouts=True):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 90, in events
for notification in notification_generator:
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 74, in current_notifies
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: Connection timed out
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
^^^^^^^^
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/__init__.py", line 177, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/__init__.py", line 436, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 412, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/core/management/base.py", line 458, in execute
output = self.handle(*args, **options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/management/commands/run_rsyslog_configurer.py", line 29, in handle
for e in conn.events(yield_timeouts=True):
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 90, in events
for notification in notification_generator:
File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/__init__.py", line 74, in current_notifies
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: Connection timed out
Possible source of problem
Looking at DB connections from a firewall point of view might reveal the source of the problem here: With our setup idle sessions expire/are closed by firewall after 1h without any traffic.
What seems to happen is, that AWX tries to reuse those already closed connections over and over again. All attempts like this fail like stated above.
What didn´t help in our case was to configure the firewall to send a reset package to DB and AWX when closing "stale" sessions, in hope, AWX would then go ahead and establish a new session when needed next (which it currently just doesn´t do):
Possible fixes
- some kind of keepalive package (ideally with configurable interval) to prevent the firewall from closing connections. (maybe is there aleady an option in place I´m not aware of?)
- making AWX create new sessions if last used session is no longer open (if that does not produce too much overhead or concurrency issues).
- Set a max session age on AWX side, see related issue #13481
Related issues
- https://github.com/ansible/awx/issues/13481
For now we used a work around* by setting up TCP keepalives on DB side to prevent open sessions from getting stale, which solves the issue for us.
*Would switch to an application side fix via CONN_MAX_AGE or similar, as soon as available, to have everything required to run AWX cleanly in one place.
@ AWX Team Thanks for all your great work on the AWX project! 🎄 🎄 🎄 :-)
Hi @2and3makes23, thank you so much for sharing your work around! We appreciate your time and involvement with the AWX project :D
@sli68 were you able to utilize the workaround @2and3makes23 provided?
verified, not work on our side