awx icon indicating copy to clipboard operation
awx copied to clipboard

AWX v24.3.0 update breaks deployment (psycopg connection issue?)

Open ilbarone87 opened this issue 10 months ago • 13 comments

Please confirm the following

  • [X] I agree to follow this project's code of conduct.
  • [X] I have checked the current issues for duplicates.
  • [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

Hello after updating to Awx-operator 2.16.0 with awx image 2.43.0 the Awx-Task pods is not able to start anymore and it throws these errors

[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1
Instance Group already registered controlplane
Instance Group already registered default
Successfully registered instance None
(changed: True)
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
READY
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,954 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
2024-04-23 23:36:28,011 INFO     [-] awx.main.commands.run_callback_receiver Callback receiver started with pid=38
2024-04-23 23:36:28,023 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:30,307 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:30,313 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
double free or corruption (out)

After downgrading to 2.15 it starts to work again. I have awx deployed on rke2 1.27 stable version. Awx is using an external Postgres 15.1 version

AWX version

2.43.0

Select the relevant components

  • [ ] UI
  • [ ] UI (tech preview)
  • [ ] API
  • [ ] Docs
  • [ ] Collection
  • [ ] CLI
  • [X] Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Ubuntu 22.04

Web browser

Safari

Steps to reproduce

Upgrade to Awx-oprator 2.16.0 and awx 2.43.0

Expected results

Awx to start correctly

Actual results

Awx-task pod in crash

Additional information

No response

ilbarone87 avatar Apr 23 '24 23:04 ilbarone87

I think it has to do with this bug: https://www.postgresql.org/message-id/[email protected]

After some poking around I was able to find that the segfault occurs when creating the psycopg connection, and confirmed that its a known bug in this stackoverflow: https://stackoverflow.com/questions/77619281/double-free-or-corruption-out-from-psycopg2-connect-using-python-3-11-6

ryantking avatar Apr 24 '24 01:04 ryantking

Observed the same

Nyralei avatar Apr 24 '24 05:04 Nyralei

@ilbarone87 @Nyralei thank you for reporting this and the troubleshooting you've done already.

We will look at this as a priority.

gundalow avatar Apr 24 '24 06:04 gundalow

If this can help, have digged a little bit more into the issue and postgres was throwing this error:

{
  "@timestamp": [
    "2024-04-23T21:26:31.527Z"
  ],
  "container": [
    "postgres"
  ],
  "message": [
    "{\"level\":\"info\",\"ts\":\"2024-04-23T21:26:31Z\",\"logger\":\"postgres\",\"msg\":\"record\",\"logging_pod\":\"postgres-cluster-awx-1\",\"record\":{\"log_time\":\"2024-04-23 21:26:31.526 UTC\",\"process_id\":\"956926\",\"connection_from\":\"10.42.1.189:38644\",\"session_id\":\"66282787.e99fe\",\"session_line_num\":\"1\",\"session_start_time\":\"2024-04-23 21:26:31 UTC\",\"transaction_id\":\"0\",\"error_severity\":\"FATAL\",\"sql_state_code\":\"28000\",\"message\":\"no PostgreSQL user name specified in startup packet\",\"backend_type\":\"not initialized\",\"query_id\":\"0\"}}"
  ],
  "namespace": [
    "cnpg-system"
  ]

ilbarone87 avatar Apr 24 '24 12:04 ilbarone87

Hi folks, So I have reviewed the source on postgresql website for the latest releases, and they are patched in relation to the linked bug. That's said we are still trying to confirm the source of the issue you have indicated.

As for downgrading awx-operator to 2.15 to resolve this issue, I can't see any difference between 2.15 and 2.16 would cause this issue. I suspect we are missing part of the story here. https://gist.github.com/dmzoneill/497746f38c5786c96e8859f1131667af

What version of AWX did you upgrade from?

I'm using

bash-5.1$ rpm -qa | grep post
postgresql-private-libs-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-server-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-contrib-15.2-1.module_el9+264+92dde3f0.x86_64
bash-5.1$ pip3.11 list | grep cop
psycopg                     3.1.18
bash-5.1$ python3.11 --version
Python 3.11.7

If you could provide more specific versioning, that may help, thank you

dmzoneill avatar Apr 24 '24 14:04 dmzoneill

@dmzoneill I think what @ilbarone87 actually means by downgrading to 2.15 version of awx-operator is that by downgrading operator awx image tag changes to 24.2.0 too. So the issue lies in awx 24.3.0 itself and not operator. I observe the same error - "double free or corruption (out)" when changing image_version to 24.3.0 from 24.2.0.

Nyralei avatar Apr 24 '24 14:04 Nyralei

@dmzoneill what @Nyralei is saying is correct, the problem is on the awx image 24.2.0, that's why I posted the issue here and not in the awx-operator repo. The mention of the operator version was just because they are bundled together.

ilbarone87 avatar Apr 24 '24 15:04 ilbarone87

Digging through the diff doesn't indicate a change this would cause this behaviour. If you can continue to provide more info, that would be great. thanks

git diff 24.2.0 24.3.0

dmzoneill avatar Apr 24 '24 15:04 dmzoneill

We have the same problem with 23.9.0 and the docker-compose setup, re-installed yesterday.

My guess is that is has to do with the base image, where OpenSSL got updated to 3.2.1 :thinking:

mensler avatar Apr 24 '24 15:04 mensler

This difference?

$ docker run --rm -it quay.io/ansible/awx:24.2.0 dnf list installed openssl*
Installed Packages
openssl.x86_64                             1:3.0.7-27.el9                        @baseos
openssl-libs.x86_64                        1:3.0.7-27.el9                        @System

$ docker run --rm -it quay.io/ansible/awx:24.3.0 dnf list installed openssl*
Installed Packages
openssl.x86_64                             1:3.2.1-1.el9                        @baseos
openssl-libs.x86_64                        1:3.2.1-1.el9                        @System

kurokobo avatar Apr 24 '24 15:04 kurokobo

quay.io/fosterseth/awx:openssl307

here is an image based on awx v24.3.0 but with a downgraded openssl

if someone wants to test that image out, that would be useful feedback

fosterseth avatar Apr 24 '24 18:04 fosterseth

@fosterseth It works, but now I encounter error with migrations mentioned in issue https://github.com/ansible/awx/issues/15137

Nyralei avatar Apr 24 '24 19:04 Nyralei

@Nyralei thanks for quickly testing that.

I updated that image to pull in Alan's commit, if you want to re-pull and deploy again

fosterseth avatar Apr 24 '24 21:04 fosterseth

@fosterseth Updating to 24.3.1 fixed the issue

Nyralei avatar May 02 '24 08:05 Nyralei

Solved for me too. Will close the issue.

ilbarone87 avatar May 02 '24 08:05 ilbarone87