awx
awx copied to clipboard
AWX v24.3.0 update breaks deployment (psycopg connection issue?)
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
- [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to
[email protected]
instead.)
Bug Summary
Hello after updating to Awx-operator 2.16.0 with awx image 2.43.0 the Awx-Task pods is not able to start anymore and it throws these errors
[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1
Instance Group already registered controlplane
Instance Group already registered default
Successfully registered instance None
(changed: True)
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
READY
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,954 INFO [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
2024-04-23 23:36:28,011 INFO [-] awx.main.commands.run_callback_receiver Callback receiver started with pid=38
2024-04-23 23:36:28,023 INFO [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:30,307 INFO [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:30,313 INFO [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
double free or corruption (out)
After downgrading to 2.15 it starts to work again. I have awx deployed on rke2 1.27 stable version. Awx is using an external Postgres 15.1 version
AWX version
2.43.0
Select the relevant components
- [ ] UI
- [ ] UI (tech preview)
- [ ] API
- [ ] Docs
- [ ] Collection
- [ ] CLI
- [X] Other
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
Ubuntu 22.04
Web browser
Safari
Steps to reproduce
Upgrade to Awx-oprator 2.16.0 and awx 2.43.0
Expected results
Awx to start correctly
Actual results
Awx-task pod in crash
Additional information
No response
I think it has to do with this bug: https://www.postgresql.org/message-id/[email protected]
After some poking around I was able to find that the segfault occurs when creating the psycopg connection, and confirmed that its a known bug in this stackoverflow: https://stackoverflow.com/questions/77619281/double-free-or-corruption-out-from-psycopg2-connect-using-python-3-11-6
Observed the same
@ilbarone87 @Nyralei thank you for reporting this and the troubleshooting you've done already.
We will look at this as a priority.
If this can help, have digged a little bit more into the issue and postgres was throwing this error:
{
"@timestamp": [
"2024-04-23T21:26:31.527Z"
],
"container": [
"postgres"
],
"message": [
"{\"level\":\"info\",\"ts\":\"2024-04-23T21:26:31Z\",\"logger\":\"postgres\",\"msg\":\"record\",\"logging_pod\":\"postgres-cluster-awx-1\",\"record\":{\"log_time\":\"2024-04-23 21:26:31.526 UTC\",\"process_id\":\"956926\",\"connection_from\":\"10.42.1.189:38644\",\"session_id\":\"66282787.e99fe\",\"session_line_num\":\"1\",\"session_start_time\":\"2024-04-23 21:26:31 UTC\",\"transaction_id\":\"0\",\"error_severity\":\"FATAL\",\"sql_state_code\":\"28000\",\"message\":\"no PostgreSQL user name specified in startup packet\",\"backend_type\":\"not initialized\",\"query_id\":\"0\"}}"
],
"namespace": [
"cnpg-system"
]
Hi folks, So I have reviewed the source on postgresql website for the latest releases, and they are patched in relation to the linked bug. That's said we are still trying to confirm the source of the issue you have indicated.
As for downgrading awx-operator to 2.15 to resolve this issue, I can't see any difference between 2.15 and 2.16 would cause this issue. I suspect we are missing part of the story here. https://gist.github.com/dmzoneill/497746f38c5786c96e8859f1131667af
What version of AWX did you upgrade from?
I'm using
bash-5.1$ rpm -qa | grep post
postgresql-private-libs-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-server-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-contrib-15.2-1.module_el9+264+92dde3f0.x86_64
bash-5.1$ pip3.11 list | grep cop
psycopg 3.1.18
bash-5.1$ python3.11 --version
Python 3.11.7
If you could provide more specific versioning, that may help, thank you
@dmzoneill I think what @ilbarone87 actually means by downgrading to 2.15 version of awx-operator is that by downgrading operator awx image tag changes to 24.2.0 too. So the issue lies in awx 24.3.0 itself and not operator. I observe the same error - "double free or corruption (out)" when changing image_version to 24.3.0 from 24.2.0.
@dmzoneill what @Nyralei is saying is correct, the problem is on the awx image 24.2.0, that's why I posted the issue here and not in the awx-operator repo. The mention of the operator version was just because they are bundled together.
Digging through the diff doesn't indicate a change this would cause this behaviour. If you can continue to provide more info, that would be great. thanks
git diff 24.2.0 24.3.0
We have the same problem with 23.9.0 and the docker-compose setup, re-installed yesterday.
My guess is that is has to do with the base image, where OpenSSL got updated to 3.2.1 :thinking:
This difference?
$ docker run --rm -it quay.io/ansible/awx:24.2.0 dnf list installed openssl*
Installed Packages
openssl.x86_64 1:3.0.7-27.el9 @baseos
openssl-libs.x86_64 1:3.0.7-27.el9 @System
$ docker run --rm -it quay.io/ansible/awx:24.3.0 dnf list installed openssl*
Installed Packages
openssl.x86_64 1:3.2.1-1.el9 @baseos
openssl-libs.x86_64 1:3.2.1-1.el9 @System
quay.io/fosterseth/awx:openssl307
here is an image based on awx v24.3.0 but with a downgraded openssl
if someone wants to test that image out, that would be useful feedback
@fosterseth It works, but now I encounter error with migrations mentioned in issue https://github.com/ansible/awx/issues/15137
@Nyralei thanks for quickly testing that.
I updated that image to pull in Alan's commit, if you want to re-pull and deploy again
@fosterseth Updating to 24.3.1 fixed the issue
Solved for me too. Will close the issue.