contrib-helm-chart
contrib-helm-chart copied to clipboard
readiness + liveness for worker
The adhocworker and scheduledworker deployments are missing a proper readiness and/or liveness check. This leads to unresponsive containers in the case where redis connections get closed.
[2021-01-27 14:22:04,485][PID:6][WARNING][MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
next(loop)
File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
cb(*cbargs)
File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1052, in on_readable
self.cycle.on_readable(fileno)
File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 348, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 679, in _receive
ret.append(self._receive_one(c))
File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 690, in _receive_one
response = c.parse_response()
File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 3036, in parse_response
return self._execute(connection, connection.read_response)
File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 3013, in _execute
return command(*args)
File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 637, in read_response
response = self._parser.read_response()
File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 290, in read_response
response = self._buffer.readline()
File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 224, in readline
self._read_from_socket()
File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 199, in _read_from_socket
(e.args,))
ConnectionError: Error while reading from socket: (u'Connection closed by server.',)
I think the best would be to crash the containers whenever those errors occur otherwise the liveness-check should report an error.
@dabeck good catch - were you able to identify why the connection was getting closed? @arikfr do you know offhand if there is an easy way to monitor worker health? Kubernetes can support either a http endpoint (i.e. 200 status = OK) or a command (i.e. exit 0 = OK). Alternatively perhaps a "crash on error" configuration setting that would cause services (both workers and main server even) to crash if they encounter any kind of error that is not recovering (perhaps after a few retries)?
@grugnog due to a clustered environment we must always expect such failures. In my case the redis instance was shifted to another node and redash lost the connection without properly getting it back. The log stated in my initial comment was the last message of the workers.
@dabeck are you still having this issue? Did you come up with any workaround?
@grugnog I am currently no longer responsible for this issue. I'll let you know when I checked it in the next days/weeks.
add pgrep
or ps
to check running process inside pods.
pgrep
and ps
not available inside the worker's container.