st2-docker
st2-docker copied to clipboard
ServerSelectionTimeoutError when services connect to mongo.
Problem
I tried to install StackStorm on Docker using this repository's guide without any changes done to the files.
After executing docker-compose
, every service that depends on the mongo
service, and tries to connect to its database throws the exception ServerSelectionTimeoutError
:
st2timersengine_1 | 2022-03-22 18:19:08,934 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescrip
tion id: 623a13191966d2953888ab75, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
st2timersengine_1 | 2022-03-22 18:19:08,934 ERROR [-] (PID=1) TimerEngine quit due to exception.
st2timersengine_1 | Traceback (most recent call last):
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2reactor/cmd/timersengine.py", line 93, in main
st2timersengine_1 | _setup()
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2reactor/cmd/timersengine.py", line 50, in _setup
st2timersengine_1 | capabilities=capabilities,
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/service_setup.py", line 249, in setup
st2timersengine_1 | db_setup()
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/database_setup.py", line 55, in db_setup
st2timersengine_1 | connection = db_init.db_setup_with_retry(**db_cfg)
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/persistence/db_init.py", line 93, in db_setup_with_retry
st2timersengine_1 | ssl_match_hostname=ssl_match_hostname,
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/persistence/db_init.py", line 58, in db_func_with_retry
st2timersengine_1 | return retrying_obj.call(db_func, *args, **kwargs)
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/retrying.py", line 206, in call
st2timersengine_1 | return attempt.get(self._wrap_exception)
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/retrying.py", line 247, in get
st2timersengine_1 | six.reraise(self.value[0], self.value[1], self.value[2])
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/six.py", line 696, in reraise
st2timersengine_1 | raise value
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/retrying.py", line 200, in call
st2timersengine_1 | attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/models/db/__init__.py", line 251, in db_setup
st2timersengine_1 | ssl_match_hostname=ssl_match_hostname,
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/models/db/__init__.py", line 213, in _db_connect
st2timersengine_1 | raise e
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/models/db/__init__.py", line 204, in _db_connect
st2timersengine_1 | connection.admin.command("ping")
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/database.py", line 737, in command
st2timersengine_1 | read_preference, session) as (sock_info, slave_ok):
st2timersengine_1 | File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
st2timersengine_1 | return next(self.gen)
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1325, in _socket_for_reads
st2timersengine_1 | server = self._select_server(read_preference, session)
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1278, in _select_server
st2timersengine_1 | server = topology.select_server(server_selector)
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/topology.py", line 243, in select_server
st2timersengine_1 | address))
st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/topology.py", line 200, in select_servers
st2timersengine_1 | selector, server_timeout, address) st2timersengine_1 | File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/topology.py", line 217, in _select_servers_loop
st2timersengine_1 | (self._error_message(selector), timeout, self.description))
st2timersengine_1 | pymongo.errors.ServerSelectionTimeoutError: No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescription id: 623a13191966d2953888ab75, topology_type: Single,
servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
After doing some digging, the source of this issue the exception ServerSelectionTimeoutError
is caught in line 205 of this file https://github.com/StackStorm/st2/blob/master/st2common/st2common/models/db/init.py.
Some services keep restarting indefinitely due to the restart: on-failure
option and only st2-docker_st2web_1
and st2-docker_st2client_1
stay up, but they are unusable.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5cab2898f2c9 stackstorm/st2actionrunner:latest "/st2client-startup.…" 15 minutes ago Up 2 minutes st2-docker_st2client_1
a265df0d2466 stackstorm/st2chatops:latest "/st2chatops-startup…" 15 minutes ago Exited (0) 2 minutes ago st2-docker_st2chatops_1
299d5ab21193 stackstorm/st2web:latest "/bin/bash -c 'if [ …" 15 minutes ago Up 2 minutes (unhealthy) 127.0.0.1:80->80/tcp, 443/tcp st2-docker_st2web_1
b47e4c381ba9 stackstorm/st2timersengine:latest "/opt/stackstorm/st2…" 15 minutes ago Up 5 seconds st2-docker_st2timersengine_1
f424783795a7 stackstorm/st2actionrunner:latest "/opt/stackstorm/st2…" 15 minutes ago Restarting (1) 5 seconds ago st2-docker_st2actionrunner_1
e584a6a1dee6 stackstorm/st2stream:latest "/opt/stackstorm/st2…" 15 minutes ago Up 1 second 9102/tcp st2-docker_st2stream_1
75ae17a1eabf stackstorm/st2auth:latest "/opt/stackstorm/st2…" 15 minutes ago Up 9 seconds 9100/tcp st2-docker_st2auth_1
fe556f0a4bd6 stackstorm/st2workflowengine:latest "/opt/stackstorm/st2…" 15 minutes ago Restarting (1) 4 seconds ago st2-docker_st2workflowengine_1
e0e6f4ad0aa9 stackstorm/st2sensorcontainer:latest "/opt/stackstorm/st2…" 15 minutes ago Up 7 seconds st2-docker_st2sensorcontainer_1
0676f3a3b7c1 stackstorm/st2garbagecollector:latest "/opt/stackstorm/st2…" 15 minutes ago Restarting (1) 4 seconds ago st2-docker_st2garbagecollector_1
fec487dd323b stackstorm/st2scheduler:latest "/opt/stackstorm/st2…" 15 minutes ago Up 12 seconds st2-docker_st2scheduler_1
ece3af683ef5 stackstorm/st2notifier:latest "/opt/stackstorm/st2…" 15 minutes ago Up 20 seconds st2-docker_st2notifier_1
2e523c091513 stackstorm/st2rulesengine:latest "/opt/stackstorm/st2…" 15 minutes ago Up 22 seconds st2-docker_st2rulesengine_1
ad9b149797e6 stackstorm/st2api:latest "/opt/stackstorm/st2…" 15 minutes ago Up 5 seconds 9101/tcp st2-docker_st2api_1
97329d112d76 stackstorm/st2actionrunner:latest "/makesecrets.sh" 15 minutes ago Exited (0) 2 minutes ago st2-docker_st2makesecrets_1
01453a00d51b mongo:4.0 "docker-entrypoint.s…" 15 minutes ago Up 2 minutes 27017/tcp st2-docker_mongo_1
a971e05ba16d rabbitmq:3.8 "docker-entrypoint.s…" 15 minutes ago Up 2 minutes 4369/tcp, 5671-5672/tcp, 15691-15692/tcp, 25672/tcp st2-docker_rabbitmq_1
a8a72c873785 redis:6.2 "docker-entrypoint.s…" 15 minutes ago Up 2 minutes 6379/tcp st2-docker_redis_1
Anyone else encountered this error?
Versions
- Host OS: Debian GNU/Linux 11 (bullseye)
- docker: 20.10.5+dfsg1
- docker-compose: 1.25.0
- stackstorm version: v3.6.0
How to reproduce
Follow the README of this repository.
I got this problem too. I haven't solved it
Same issue being faced here !! Unable to find a solution !!
Back to previous version(st2-docker-3.6) can solve it.
Same issue while trying to start st2 3.7.0 docker images, running on:
[root@lp-bejodeve-ubu st2-docker]# hostnamectl
Static hostname: localhost.localdomain
Transient hostname: URL
Icon name: computer-vm
Chassis: vm
Virtualization: vmware
Operating System: Rocky Linux 8.6 (Green Obsidian)
CPE OS Name: cpe:/o:rocky:rocky:8:GA
Kernel: Linux 4.18.0-372.9.1.el8.x86_64
Architecture: x86-64
[root@lp-bejodeve-ubu st2-docker]# docker --version
Docker version 20.10.17, build 100c701
[root@lp-bejodeve-ubu st2-docker]# docker-compose --version
Docker Compose version v2.6.1
[root@lp-bejodeve-ubu st2-docker]#
However, in my case going back to 3.6.0 did not help either, running these containers produces same issue:
[root@lp-bejodeve-ubu st2-docker]# docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
58070676a063 stackstorm/st2web:3.6.0 "/bin/bash -c 'if [ …" 3 minutes ago Up 2 minutes (unhealthy) 443/tcp, 0.0.0.0:8899->80/tcp st2-docker-st2web-1
f82ddfd03b68 stackstorm/st2actionrunner:3.6.0 "/st2client-startup.…" 3 minutes ago Up 2 minutes st2-docker-st2client-1
5c0724094489 stackstorm/st2rulesengine:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up 10 seconds st2-docker-st2rulesengine-1
598c8288b678 stackstorm/st2notifier:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Restarting (1) 5 seconds ago st2-docker-st2notifier-1
1d1495532504 stackstorm/st2sensorcontainer:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up 11 seconds st2-docker-st2sensorcontainer-1
b016fbfcd6b1 stackstorm/st2workflowengine:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up 10 seconds st2-docker-st2workflowengine-1
2e0f0bb87a14 stackstorm/st2actionrunner:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Restarting (1) 7 seconds ago st2-docker-st2actionrunner-1
f892a9365e94 stackstorm/st2scheduler:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up 1 second st2-docker-st2scheduler-1
43985cf22e90 stackstorm/st2timersengine:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up 11 seconds st2-docker-st2timersengine-1
a220342568ff stackstorm/st2auth:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up Less than a second 9100/tcp st2-docker-st2auth-1
122eafc2e421 stackstorm/st2garbagecollector:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up 9 seconds st2-docker-st2garbagecollector-1
d7c48abdd377 stackstorm/st2stream:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Restarting (1) 6 seconds ago st2-docker-st2stream-1
a2fb891fede2 stackstorm/st2api:3.6.0 "/opt/stackstorm/st2…" 3 minutes ago Up 1 second 9101/tcp st2-docker-st2api-1
a8433cc07ddb mongo:4.4 "docker-entrypoint.s…" 3 minutes ago Up 3 minutes 27017/tcp st2-docker-mongo-1
aae9d96b8915 rabbitmq:3.8 "docker-entrypoint.s…" 3 minutes ago Up 3 minutes 4369/tcp, 5671-5672/tcp, 15691-15692/tcp, 25672/tcp st2-docker-rabbitmq-1
5b57bd0e2367 redis:6.2 "docker-entrypoint.s…" 3 minutes ago Up 3 minutes 6379/tcp st2-docker-redis-1
Attached logs from containers. st2 docker logs.txt
I would check if mongo
host (container) is reachable from the st2 containers.
If yes, I'd try to connect to it from the mongo client to understand if the DB is working.
If mongo client could connect to the DB, while st2 cannot - it's a stackstorm problem. If mongo client could not connect to the DB, - it's a mongodb or docker problem.
I've actually spent a lot of time back in September trying to solve this issue without any definitive answers. On Ubuntu VMs I encounter the issue, but in WSL2 running Rancher desktop no issues. Between me and my team, we were able to replicate on Ubuntu 20.04 and 22.04. From my notes, the issue presents sometime between dockerd 20.10.13 and 20.10.17. Generally, I have to reboot after downgrading to resolve, and if I upgrade after it works once, I'm good. But on a fresh install, problems.
Ultimately, the issue presents as a network race condition. It's not a DNS timeout, that has a different error. I also tried adding a script to sleep before starting things like st2actionrunner to try and give time for the network to establish between st2actionrunner container and mongo.
Here's what I know:
When the issue occurs, the error message is relatively vague:
st2api_1 | 2022-09-13 13:15:48,108 INFO [-] Connecting to database "st2" @ "mongo:27017" as user "None".
st2api_1 | 2022-09-13 13:15:51,121 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescription id: 63208284d0b0393f8dd98335, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
If you manage to shell into the container before it exits, you'll find that the mongo instance is reachable. I had a mongocheck.py script that I would mount in the container and run in before the container was restarted.
Here is the script. In it, I have the errors you see if you have bad dns or the port is unreachable.
import pymongo
# >>> client = pymongo.MongoClient("afakehost", 27017) # invalid host
# >>> db = client.st2
# >>> db.my_collection.insert_one({"x": 10}).inserted_id
# pymongo.errors.ServerSelectionTimeoutError: afakehost:27017: [Errno -3] Temporary failure in name resolution, Timeout: 30s, Topology Description: <TopologyDescription id: 6308e86ed939509aa52e4eeb, topology_type: Single, servers: [<ServerDescription ('afakehost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('afakehost:27017: [Errno -3] Temporary failure in name resolution')>]>
# >>> client = pymongo.MongoClient("mongo", 27019) # invalid port
# >>> db = client.st2
# >>> db.my_collection.insert_one({"x": 10}).inserted_id
mongohost = "mongo"
mongoport = 27017
print(f"Connecting to {mongohost}:{mongoport}")
client = pymongo.MongoClient("mongo", 27017)
db = client.st2
inserted = db.my_collection.insert_one({"x": 10}).inserted_id
print(f"inserted id of {inserted}")
Here 50.10.13 of the docker-ce package had it working for me. I had downgraded from 20.10.17 and my docker client is still on that version.
hogenj@devg:~/projects/st2-docker$ apt-cache madison docker-ce
docker-ce | 5:20.10.17~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
docker-ce | 5:20.10.16~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
docker-ce | 5:20.10.15~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
docker-ce | 5:20.10.14~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
docker-ce | 5:20.10.13~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
sudo apt install docker-ce=5:20.10.13~3-0~ubuntu-jammy
hogenj@devg:~/projects/st2-docker$ docker --version
Docker version 20.10.17, build 100c701
hogenj@devg:~/projects/st2-docker$ docker-compose --version
docker-compose version 1.29.2, build unknown
hogenj@devg:~/projects/st2-docker$ dockerd --version
Docker version 20.10.13, build 906f57f
Since it's been a while and there are newer versions of docker-ce out, I'm going to try upgrading my dev vm and see where things stand.
After updates
Docker version 20.10.21, build baeda1f Docker version 20.10.21, build 3056208 docker-compose version 1.29.2, build unknown
I made some changes to st2api to try and work around any race conditions
st2api:
image: ${ST2_IMAGE_REPO:-stackstorm/}st2api:${ST2_VERSION:-latest}
...
command: /startst2api.sh
# command: /mongocheck2.py
volumes:
- ./scripts/st2client-startup.sh:/st2client-startup.sh:ro
- ./files/mongocheck.py:/mongocheck.py:rw
- ./files/mongocheck2.py:/mongocheck2.py:rw
- ./files/startst2api.sh:/startst2api.sh:rw
...
startst2api.sh is:
#!/bin/bash
echo "sleeping 30 seconds"
sleep 30
echo "starting st2api"
/opt/stackstorm/st2/bin/st2api --config-file=/etc/st2/st2.conf --config-file=/etc/st2/st2.docker.conf --config-file=/etc/st2/st2.user.conf
If I start that, I get the vague server selection timeout:
hogenj@devg:~/projects/st2-docker$ docker-compose up st2api
st2-docker_mongo_1 is up-to-date
st2-docker_redis_1 is up-to-date
st2-docker_rabbitmq_1 is up-to-date
Starting st2-docker_st2makesecrets_1 ... done
Recreating st2-docker_st2api_1 ... done
Attaching to st2-docker_st2api_1
st2api_1 | sleeping 30 seconds
st2api_1 | starting st2api
st2api_1 | 2022-11-21 15:08:05,436 INFO [-] Using Python: 3.8.10 (/opt/stackstorm/st2/bin/python)
st2api_1 | 2022-11-21 15:08:05,436 INFO [-] Using fs encoding: utf-8, default encoding: utf-8, locale: en_US.UTF-8, LANG env variable: en_US.UTF-8, PYTHONIOENCODING env variable: notset
st2api_1 | 2022-11-21 15:08:05,436 INFO [-] Using config files: /etc/st2/st2.conf,/etc/st2/st2.docker.conf,/etc/st2/st2.user.conf
st2api_1 | 2022-11-21 15:08:05,437 INFO [-] Using logging config: /etc/st2/logging.api.gunicorn.conf
st2api_1 | 2022-11-21 15:08:05,437 INFO [-] Using coordination driver: redis
st2api_1 | 2022-11-21 15:08:05,437 INFO [-] Using metrics driver: noop
st2api_1 | 2022-11-21 15:08:05,461 INFO [-] Connecting to database "st2" @ "mongo:27017" as user "None".
st2api_1 | 2022-11-21 15:08:08,484 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescription id: 637b945500772fa92caefd9a, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
(snip)
however...
If I more or less create the code, even withtout a sleep, that works.
files/mongocheck2.py
#!/opt/stackstorm/st2/bin/python
import pymongo
import mongoengine
from pymongo.errors import OperationFailure
from pymongo.errors import ConnectionFailure
from pymongo.errors import ServerSelectionTimeoutError
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
from oslo_config import cfg
# import st2common.config as common_config
from st2api import config
config.register_opts(ignore_errors=True)
import st2common.models.db as db
db.LOG.setLevel(logging.DEBUG)
db_host = "mongo"
db_port = 27017
db_name = "st2"
username = None
password = None
connection_timeout = 3000 # ms
ssl_kwargs = {}
compressor_kwargs = {}
print('running _db_connect')
connection = db._db_connect(db_name, db_host, db_port)
# NOTE: We intentionally set "serverSelectionTimeoutMS" to 3 seconds. By default it's set to
# 30 seconds, which means it will block up to 30 seconds and fail if there are any SSL related
# or other errors
connection_timeout = cfg.CONF.database.connection_timeout
connection = mongoengine.connection.connect(
db_name,
host=db_host,
port=db_port,
tz_aware=True,
# alias='foo',
username=username,
password=password,
connectTimeoutMS=connection_timeout,
serverSelectionTimeoutMS=connection_timeout,
**ssl_kwargs,
**compressor_kwargs,
)
# NOTE: Since pymongo 3.0, connect() method is lazy and not blocking (always returns success)
# so we need to issue a command / query to check if connection has been
# successfully established.
# See http://api.mongodb.com/python/current/api/pymongo/mongo_client.html for details
try:
# The ping command is cheap and does not require auth
# https://www.mongodb.com/community/forums/t/how-to-use-the-new-hello-interface-for-availability/116748/
connection.admin.command("ping")
except (ConnectionFailure, ServerSelectionTimeoutError) as e:
# NOTE: ServerSelectionTimeoutError can also be thrown if SSLHandShake fails in the server
# Sadly the client doesn't include more information about the error so in such scenarios
# user needs to check MongoDB server log
print(f'Failed to connect to database connected to database "{db_name}" @ "{db_host}:{db_port}" as user "{username}".')
raise e
print(f'Successfully connected to database "{db_name}" @ "{db_host}:{db_port}" as user "{username}".')
# connection.close()
and I set that as the command in st2api
command: /opt/stackstorm/st2/bin/python /mongocheck2.py
volumes:
- ./files/mongocheck2.py:/mongocheck2.py:rw
ogenj@devg:~/projects/st2-docker$ docker-compose up st2api
Starting st2-docker_st2makesecrets_1 ...
st2-docker_mongo_1 is up-to-date
st2-docker_rabbitmq_1 is up-to-date
Starting st2-docker_st2makesecrets_1 ... done
Starting st2-docker_st2api_1 ... done
Attaching to st2-docker_st2api_1
st2api_1 | running _db_connect
st2api_1 | Successfully connected to database "st2" @ "mongo:27017" as user "None".
st2api_1 | Inserting one using pymongo
st2api_1 | inserted id of 637ba0ecbdf365d52a98264e
st2-docker_st2api_1 exited with code 0
And because this is already a wall of text, I'm providing mongocheck2.py as a gist. It's basically a copy of the st2common db connection.
One interesting note which might not be related. After running mongocheck2.py, I started getting an error about not running disconnect first. That's why in mongocheck2.py, I added a line where alias='foo'
.
st2api_1 | running _db_connect
st2api_1 | Traceback (most recent call last):
st2api_1 | File "/mongocheck2.py", line 40, in <module>
st2api_1 | connection = mongoengine.connection.connect(
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/connection.py", line 395, in connect
st2api_1 | raise ConnectionFailure(err_msg)
st2api_1 | mongoengine.connection.ConnectionFailure: A different connection with alias `default` was already registered. Use disconnect() first
First, it's OK when stackstorm tries to connect to MongoDB which is not yet available and fail. It'll always retry until establishing a successful connection.
The StackStorm services could work in 2 different ways when it comes to DB backend connection:
- retry multiple times until MongoDB connection is established (default)
- exit process on first failure and let orchestrator handle restarts (default in Docker/K8s environment)
In Docker/K8s we override these settings because restarting on failure and letting the orchestrators handle the container restarts is the better way for these systems to control the recovery, liveness, and availability: https://github.com/StackStorm/st2-dockerfiles/blob/618f6dcff4d84aa5c26f5dd719a33bf309dbbcbf/base/files/st2.tmp.conf#L17-L21
If you'd like to retry the connection until it's successful in an existing running process, you can revert the default st2 settings here: https://github.com/StackStorm/st2/blob/dfab25aab0468f330a635db55f01286523217135/conf/st2.conf.sample#L129-L136 and also fine-tune timeouts, number of retries, delays, and so on.
What are the benefits of a middle-man script that waits for MongoDB to be up & running if st2 can handle reconnection itself?
I do apologize. I do need to remember when uploading a lot of troubleshooting that I should summarize it some.
- the easiest way to reproduce this is to spin up a clean VM of Ubuntu 20.04 or 22.04 and install the latest docker, docker-compose. It does seem that those who have had this setup working, it continues to work through upgrades.
- mongo (and redis) is up before st2api - I bring those up first, and also they're a dependency of st2api in the docker-compose file.
- Alternate mongo clients, including one I wrote based on the code in st2common/models/db/init.py connect up to mongo with no issues.
- st2api (and all the others) fail to connect to mongo using stackstorm code.
- It's not a DNS or port closed error.
- I did the middle-man script to eliminate some potential race conditions, like docker networking not being established. It waits for 30 seconds and then starts st2api. Even with that wait, st2api still fails to connect to mongo.
Per your suggestion, I went ahead and updated my retry configurations.
# Don't try to reconnect to database and exit early and allow k8s to handle reconnection's for it to keep retrying.
[database]
# Connection retry backoff max (seconds).
connection_retry_backoff_max_s = 12
# Backoff multiplier (seconds).
connection_retry_backoff_mul = 2
# Connection retry total time (minutes).
connection_retry_max_delay_m = 5
# Connection and server selection timeout (in ms).
connection_timeout = 6000
Running this has it retry multiple times, and the container does exit after a minute.
tail end
st2api_1 | 2022-11-22 01:37:07,051 INFO [-] Using metrics driver: noop
st2api_1 | 2022-11-22 01:37:07,061 INFO [-] Connecting to database "st2" @ "mongo:27017" as user "None".
st2api_1 | 2022-11-22 01:37:13,080 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 6.0s, Topology Description: <TopologyDescription id: 637c27c32cb168d993d30aa9, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
st2api_1 | 2022-11-22 01:37:13,081 ERROR [-] (PID=1) ST2 API quit due to exception.
st2api_1 | Traceback (most recent call last):
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/cmd/api.py", line 98, in main
st2api_1 | _setup()
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/cmd/api.py", line 60, in _setup
st2api_1 | common_setup(
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/service_setup.py", line 252, in setup
st2api_1 | db_setup()
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/database_setup.py", line 55, in db_setup
st2api_1 | connection = db_init.db_setup_with_retry(**db_cfg)
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/persistence/db_init.py", line 79, in db_setup_with_retry
st2api_1 | return db_func_with_retry(
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/persistence/db_init.py", line 58, in db_func_with_retry
st2api_1 | return retrying_obj.call(db_func, *args, **kwargs)
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/retrying.py", line 206, in call
st2api_1 | return attempt.get(self._wrap_exception)
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/retrying.py", line 247, in get
st2api_1 | six.reraise(self.value[0], self.value[1], self.value[2])
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/six.py", line 696, in reraise
st2api_1 | raise value
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/retrying.py", line 200, in call
st2api_1 | attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/models/db/__init__.py", line 239, in db_setup
st2api_1 | connection = _db_connect(
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/models/db/__init__.py", line 213, in _db_connect
st2api_1 | raise e
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/models/db/__init__.py", line 204, in _db_connect
st2api_1 | connection.admin.command("ping")
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/database.py", line 736, in command
st2api_1 | with self.__client._socket_for_reads(
st2api_1 | File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
st2api_1 | return next(self.gen)
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1325, in _socket_for_reads
st2api_1 | server = self._select_server(read_preference, session)
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1278, in _select_server
st2api_1 | server = topology.select_server(server_selector)
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/topology.py", line 241, in select_server
st2api_1 | return random.choice(self.select_servers(selector,
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/topology.py", line 199, in select_servers
st2api_1 | server_descriptions = self._select_servers_loop(
st2api_1 | File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/topology.py", line 215, in _select_servers_loop
st2api_1 | raise ServerSelectionTimeoutError(
st2api_1 | pymongo.errors.ServerSelectionTimeoutError: No servers found yet, Timeout: 6.0s, Topology Description: <TopologyDescription id: 637c27c32cb168d993d30aa9, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
st2-docker_st2api_1 exited with code 1
real 1m5.811s
user 0m0.514s
sys 0m0.099s
I ran into the same issue. Somewhere, buried ONLY in the first attempt to start st2 (and incidentally mongodb) was the following error
mongo_1 | {"t":{"$date":"2022-11-22T03:56:33.766+00:00"},"s":"F", "c":"STORAGE", "id":4671205, "ctx":"initandlisten","msg":"This version of MongoDB is too recent to start up on the existing data files. Try MongoDB 4.2 or earlier."}
Clearing up my volumes to ensure I didn't have compatibility issues was a way to resolve this for me.
OR migrate your existing data from your current version to mongo:4.4
you could try to docker-compose up mongo
inspect around. De-couple containers and try to launch them individually.
I noticed mongo was dying with exit code 14, each time it would respawn it would print some log (but not all of it)
UPDATE:
In my case, other services had a problem connecting mongo and failed with ServerSelectionTimeoutError
, but mongo itself was crashing, and the version compat mismatch message was only emitted in the beginning.