contrib-helm-chart Generic worker becomes stale

Hello,

I've stumbled upon an issue where generic workers (and possibly scheduled workers too) become stable at arbitrary intervals. By stale I mean they don't pick up new jobs neither process anything. The only workaround I've found so far is to kill the pods so that they get recreated, but I'm trying to automate this.

Have anyone had the same problem?

Chart Version: 3.0.0-beta1

Apr 28 '22 07:04 koslib

I believe I'm having the same issue - after a while I'm reaching "Unknown error occurred while performing connection test" for all queries and it seems that adhocworker gets stuck. Currently it's only a guess because there are no indicative logs in any of the pods.

May 20 '22 19:05 oedri

Would be good to confirm if this is a chart specific issue or a data source connection issue - I see some reports of this message (e.g. https://github.com/getredash/redash/issues/2047 & https://github.com/getredash/redash/issues/5664). If it's only a temporary data source connection issue I would expect the worker to continue once the connection is back at least (not sure exactly how it works), but if that isn't the case I am guessing it's an application bug. We could also look for ways to improve the health check to detect if this is happening and restart the worker then, although we should probably open an ticket with the application also (I guess it's conceivable this is somehow correct behavior).

Jun 13 '22 16:06 grugnog

@grugnog This happened for all datasources I've tried - Postgres and Prometheus. Connections worked again after restarting the workers.

Jun 13 '22 16:06 oedri

@oedri if you are able to add any detail (debug logs, strace perhaps?) it would be great if you could open a ticket regarding this on https://github.com/getredash/redash - it seems unlikely to be a Kubernetes issue, except perhaps something environmental (resource exhaustion etc) which is not really in scope of this chart, although we could adjust the docs/defaults perhaps if we identify that as the cause. On the detection/recovery side, we have an existing open issue for that https://github.com/getredash/contrib-helm-chart/issues/72 so I think we can close this one.

Jun 13 '22 17:06 grugnog

happed to me too

Aug 29 '22 15:08 aberenshtein

happening to me everyday too.

Jan 16 '23 10:01 shubhwip

@grugnog How can we enable debug logs, strace in the redash helm charts ?

Jan 19 '23 14:01 shubhwip

+1. There seems to be an issue respawning the process if it dies. A transient redis issue triggers a persistent problem for the worker. Digging through logs of workers, I can see the following:

  April 15th 2023, 02:41:35.532 Traceback (most recent call last):
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 606, in _connect
  April 15th 2023, 02:41:35.532     raise err
  April 15th 2023, 02:41:35.532 TimeoutError: [Errno 110] Connection timed out
  April 15th 2023, 02:41:35.532     return _process_result(sub_ctx.command.invoke(sub_ctx))
  April 15th 2023, 02:41:35.532     return ctx.invoke(self.callback, **ctx.params)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/flask/cli.py", line 426, in decorator
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 1581, in exists
  April 15th 2023, 02:41:35.532     return self.execute_command('EXISTS', *names)
  April 15th 2023, 02:41:35.532     connection.connect()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 550, in connect
  April 15th 2023, 02:41:35.532     sock = self._connect()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 594, in _connect
  April 15th 2023, 02:41:35.532     sock.connect(socket_address)
  April 15th 2023, 02:41:35.532 
  April 15th 2023, 02:41:35.532     manager()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/flask/cli.py", line 586, in main
  April 15th 2023, 02:41:35.532     return super(FlaskGroup, self).main(*args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
  April 15th 2023, 02:41:35.532     return callback(*args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/app/redash/cli/rq.py", line 49, in worker
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 1182, in get_connection
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 554, in connect
  April 15th 2023, 02:41:35.532     raise ConnectionError(self._error_message(e))
  April 15th 2023, 02:41:35.532 
  April 15th 2023, 02:41:35.532 During handling of the above exception, another exception occurred:
  April 15th 2023, 02:41:35.532 Traceback (most recent call last):
  April 15th 2023, 02:41:35.532   File "./manage.py", line 9, in <module>
  April 15th 2023, 02:41:35.532     return self.main(*args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
  April 15th 2023, 02:41:35.532     rv = self.invoke(ctx)
  April 15th 2023, 02:41:35.532     return _process_result(sub_ctx.command.invoke(sub_ctx))
  April 15th 2023, 02:41:35.532     return callback(*args, **kwargs)
  April 15th 2023, 02:41:35.532     return f(get_current_context(), *args, **kwargs)
  April 15th 2023, 02:41:35.532     return __ctx.invoke(f, *args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
  April 15th 2023, 02:41:35.532     w.work()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 511, in work
  April 15th 2023, 02:41:35.532     self.register_birth()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 273, in register_birth
  April 15th 2023, 02:41:35.532     if self.connection.exists(self.key) and \
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
  April 15th 2023, 02:41:35.532     conn = self.connection or pool.get_connection(command_name, **options)
  April 15th 2023, 02:41:35.532 redis.exceptions.ConnectionError: Error 110 connecting to redash-redis-master:6379. Connection timed out.
  April 15th 2023, 02:41:35.986 2023-04-15 00:41:35,986 INFO exited: worker-0 (exit status 1; not expected)
  April 15th 2023, 02:41:36.987 2023-04-15 00:41:36,987 INFO gave up: worker-0 entered FATAL state, too many start retries too quickly
  April 15th 2023, 02:42:01.011 2023/04/15 00:42:01 [worker_healthcheck] Received TICK_60 event from supervisor
  April 15th 2023, 02:42:01.013 RESULT 2
  April 15th 2023, 02:42:01.013 2023/04/15 00:42:01 [worker_healthcheck] No processes in state RUNNING found for process worker
  April 15th 2023, 02:42:01.013 OKREADY

Liveness check for workers PR should improve the sittuation.

Apr 17 '23 12:04 aneagoe

contrib-helm-chart contrib-helm-chart copied to clipboard

Generic worker becomes stale

contrib-helm-chart
contrib-helm-chart copied to clipboard