contrib-helm-chart icon indicating copy to clipboard operation
contrib-helm-chart copied to clipboard

readiness + liveness for worker

Open dibi-codes opened this issue 4 years ago • 6 comments

The adhocworker and scheduledworker deployments are missing a proper readiness and/or liveness check. This leads to unresponsive containers in the case where redis connections get closed.

[2021-01-27 14:22:04,485][PID:6][WARNING][MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 596, in start
    c.loop(*c.loop_args())
  File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
    cb(*cbargs)
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1052, in on_readable
    self.cycle.on_readable(fileno)
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 348, in on_readable
    chan.handlers[type]()
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 679, in _receive
    ret.append(self._receive_one(c))
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 690, in _receive_one
    response = c.parse_response()
  File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 3036, in parse_response
    return self._execute(connection, connection.read_response)
  File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 3013, in _execute
    return command(*args)
  File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 637, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 290, in read_response
    response = self._buffer.readline()
  File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 224, in readline
    self._read_from_socket()
  File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 199, in _read_from_socket
    (e.args,))
ConnectionError: Error while reading from socket: (u'Connection closed by server.',)

I think the best would be to crash the containers whenever those errors occur otherwise the liveness-check should report an error.

dibi-codes avatar Jan 27 '21 18:01 dibi-codes

@dabeck good catch - were you able to identify why the connection was getting closed? @arikfr do you know offhand if there is an easy way to monitor worker health? Kubernetes can support either a http endpoint (i.e. 200 status = OK) or a command (i.e. exit 0 = OK). Alternatively perhaps a "crash on error" configuration setting that would cause services (both workers and main server even) to crash if they encounter any kind of error that is not recovering (perhaps after a few retries)?

grugnog avatar Jan 27 '21 23:01 grugnog

@grugnog due to a clustered environment we must always expect such failures. In my case the redis instance was shifted to another node and redash lost the connection without properly getting it back. The log stated in my initial comment was the last message of the workers.

dibi-codes avatar Jan 28 '21 08:01 dibi-codes

@dabeck are you still having this issue? Did you come up with any workaround?

grugnog avatar Nov 20 '21 00:11 grugnog

@grugnog I am currently no longer responsible for this issue. I'll let you know when I checked it in the next days/weeks.

dibi-codes avatar Nov 24 '21 12:11 dibi-codes

add pgrep or ps to check running process inside pods.

pparthesh avatar Jan 03 '22 13:01 pparthesh

pgrep and ps not available inside the worker's container.

shubhwip avatar Jan 16 '23 10:01 shubhwip