st2 icon indicating copy to clipboard operation
st2 copied to clipboard

WORKER TIMEOUT in logs, st2 not working

Open amyasnikov opened this issue 4 years ago • 1 comments

Discussed in https://github.com/StackStorm/st2/discussions/5422

Originally posted by amyasnikov November 1, 2021 I'm trying to make HA st2 setup without k8s. My setup:

  • 3 "cluster" nodes with rabbitmq, redis and mongodb on board
  • 2 app nodes with st2 components. 1 node is active (st2 is in running state) and 1 node in standby (st2 is in stopped state).

The first app node works well (st2 login, st2 run, st2 execution get - all commands work fine). The problem arises with second app node when I'm trying to make it active. I stop st2 on first node (st2ctl stop) and the I start st2 on second node(st2ctl start on it). On the second node any st2 command hangs for infinite time without any output (st2 login or st2 execution list or any other). If I do "st2ctl reload" on second node it works well, but st2 still hangs. Meanwhile in logs (st2api.log or st2auth.log) I see messages like this:

[2021-11-01 10:59:07 +0000] [9253] [CRITICAL] WORKER TIMEOUT (pid:9496)

If I switch back to first node, all works fine again. st2.conf is the same for 2 app nodes.

  • The network and DNS is fine. I can telnet to all the services (rabbit, mongo, redis) from both app nodes. I also tried mongosh from app node to connection-string from st2.conf, it works well (I tried to connect and read st2 database collections).
  • datastore_key.json is the same on both nodes
  • st2admin password is the same on both nodes
  • St2 installed on Centos 7 via https://docs.stackstorm.com/install/rhel7.html

What can be the reason of that behavior?

PS While preparing this text I have detected another st2auth.log strange message which didn't appear previously:

[2021-11-01 10:55:17 +0000] [9084] [ERROR] Socket error processing request.
Traceback (most recent call last):
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 65, in handle
    util.reraise(*sys.exc_info())
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/util.py", line 626, in reraise
    raise value
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 38, in handle
    listener_name = listener.getsockname()
OSError: [Errno 9] Bad file descriptor
[2021-11-01 10:55:17 +0000] [9084] [ERROR] Socket error processing request.
Traceback (most recent call last):
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 65, in handle
    util.reraise(*sys.exc_info())
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/util.py", line 626, in reraise
    raise value
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 38, in handle
    listener_name = listener.getsockname
[st2.conf.txt](https://github.com/StackStorm/st2/files/7453224/st2.conf.txt)
(
[st2auth.log](https://github.com/StackStorm/st2/files/7453222/st2auth.log)
)
OSError: [Errno 9] Bad file descriptor

software versions:

$ st2 --version
st2 3.5.0, on Python 3.6.8
$ uname -a
Linux node10 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Logs and st2.conf: st2auth.log st2.conf.txt

amyasnikov avatar Nov 09 '21 08:11 amyasnikov

Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue.

stale[bot] avatar Apr 16 '22 05:04 stale[bot]