WORKER TIMEOUT in logs, st2 not working
Discussed in https://github.com/StackStorm/st2/discussions/5422
Originally posted by amyasnikov November 1, 2021 I'm trying to make HA st2 setup without k8s. My setup:
- 3 "cluster" nodes with rabbitmq, redis and mongodb on board
- 2 app nodes with st2 components. 1 node is active (st2 is in running state) and 1 node in standby (st2 is in stopped state).
The first app node works well (st2 login, st2 run, st2 execution get - all commands work fine). The problem arises with second app node when I'm trying to make it active. I stop st2 on first node (st2ctl stop) and the I start st2 on second node(st2ctl start on it). On the second node any st2 command hangs for infinite time without any output (st2 login or st2 execution list or any other). If I do "st2ctl reload" on second node it works well, but st2 still hangs. Meanwhile in logs (st2api.log or st2auth.log) I see messages like this:
[2021-11-01 10:59:07 +0000] [9253] [CRITICAL] WORKER TIMEOUT (pid:9496)
If I switch back to first node, all works fine again. st2.conf is the same for 2 app nodes.
- The network and DNS is fine. I can telnet to all the services (rabbit, mongo, redis) from both app nodes. I also tried mongosh from app node to connection-string from st2.conf, it works well (I tried to connect and read st2 database collections).
- datastore_key.json is the same on both nodes
- st2admin password is the same on both nodes
- St2 installed on Centos 7 via https://docs.stackstorm.com/install/rhel7.html
What can be the reason of that behavior?
PS While preparing this text I have detected another st2auth.log strange message which didn't appear previously:
[2021-11-01 10:55:17 +0000] [9084] [ERROR] Socket error processing request.
Traceback (most recent call last):
File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 65, in handle
util.reraise(*sys.exc_info())
File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/util.py", line 626, in reraise
raise value
File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 38, in handle
listener_name = listener.getsockname()
OSError: [Errno 9] Bad file descriptor
[2021-11-01 10:55:17 +0000] [9084] [ERROR] Socket error processing request.
Traceback (most recent call last):
File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 65, in handle
util.reraise(*sys.exc_info())
File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/util.py", line 626, in reraise
raise value
File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 38, in handle
listener_name = listener.getsockname
[st2.conf.txt](https://github.com/StackStorm/st2/files/7453224/st2.conf.txt)
(
[st2auth.log](https://github.com/StackStorm/st2/files/7453222/st2auth.log)
)
OSError: [Errno 9] Bad file descriptor
software versions:
$ st2 --version
st2 3.5.0, on Python 3.6.8
$ uname -a
Linux node10 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Logs and st2.conf: st2auth.log st2.conf.txt
Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue.