contrib-helm-chart
contrib-helm-chart copied to clipboard
Generic worker becomes stale
Hello,
I've stumbled upon an issue where generic workers (and possibly scheduled workers too) become stable at arbitrary intervals. By stale I mean they don't pick up new jobs neither process anything. The only workaround I've found so far is to kill the pods so that they get recreated, but I'm trying to automate this.
Have anyone had the same problem?
Chart Version: 3.0.0-beta1
I believe I'm having the same issue - after a while I'm reaching "Unknown error occurred while performing connection test" for all queries and it seems that adhocworker gets stuck. Currently it's only a guess because there are no indicative logs in any of the pods.
Would be good to confirm if this is a chart specific issue or a data source connection issue - I see some reports of this message (e.g. https://github.com/getredash/redash/issues/2047 & https://github.com/getredash/redash/issues/5664). If it's only a temporary data source connection issue I would expect the worker to continue once the connection is back at least (not sure exactly how it works), but if that isn't the case I am guessing it's an application bug. We could also look for ways to improve the health check to detect if this is happening and restart the worker then, although we should probably open an ticket with the application also (I guess it's conceivable this is somehow correct behavior).
@grugnog This happened for all datasources I've tried - Postgres and Prometheus. Connections worked again after restarting the workers.
@oedri if you are able to add any detail (debug logs, strace perhaps?) it would be great if you could open a ticket regarding this on https://github.com/getredash/redash - it seems unlikely to be a Kubernetes issue, except perhaps something environmental (resource exhaustion etc) which is not really in scope of this chart, although we could adjust the docs/defaults perhaps if we identify that as the cause. On the detection/recovery side, we have an existing open issue for that https://github.com/getredash/contrib-helm-chart/issues/72 so I think we can close this one.
happed to me too
happening to me everyday too.
@grugnog How can we enable debug logs
, strace
in the redash helm charts ?
+1. There seems to be an issue respawning the process if it dies. A transient redis issue triggers a persistent problem for the worker. Digging through logs of workers, I can see the following:
April 15th 2023, 02:41:35.532 Traceback (most recent call last):
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 606, in _connect
April 15th 2023, 02:41:35.532 raise err
April 15th 2023, 02:41:35.532 TimeoutError: [Errno 110] Connection timed out
April 15th 2023, 02:41:35.532 return _process_result(sub_ctx.command.invoke(sub_ctx))
April 15th 2023, 02:41:35.532 return ctx.invoke(self.callback, **ctx.params)
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/flask/cli.py", line 426, in decorator
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 1581, in exists
April 15th 2023, 02:41:35.532 return self.execute_command('EXISTS', *names)
April 15th 2023, 02:41:35.532 connection.connect()
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 550, in connect
April 15th 2023, 02:41:35.532 sock = self._connect()
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 594, in _connect
April 15th 2023, 02:41:35.532 sock.connect(socket_address)
April 15th 2023, 02:41:35.532
April 15th 2023, 02:41:35.532 manager()
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/flask/cli.py", line 586, in main
April 15th 2023, 02:41:35.532 return super(FlaskGroup, self).main(*args, **kwargs)
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
April 15th 2023, 02:41:35.532 return callback(*args, **kwargs)
April 15th 2023, 02:41:35.532 File "/app/redash/cli/rq.py", line 49, in worker
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 1182, in get_connection
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 554, in connect
April 15th 2023, 02:41:35.532 raise ConnectionError(self._error_message(e))
April 15th 2023, 02:41:35.532
April 15th 2023, 02:41:35.532 During handling of the above exception, another exception occurred:
April 15th 2023, 02:41:35.532 Traceback (most recent call last):
April 15th 2023, 02:41:35.532 File "./manage.py", line 9, in <module>
April 15th 2023, 02:41:35.532 return self.main(*args, **kwargs)
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
April 15th 2023, 02:41:35.532 rv = self.invoke(ctx)
April 15th 2023, 02:41:35.532 return _process_result(sub_ctx.command.invoke(sub_ctx))
April 15th 2023, 02:41:35.532 return callback(*args, **kwargs)
April 15th 2023, 02:41:35.532 return f(get_current_context(), *args, **kwargs)
April 15th 2023, 02:41:35.532 return __ctx.invoke(f, *args, **kwargs)
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
April 15th 2023, 02:41:35.532 w.work()
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 511, in work
April 15th 2023, 02:41:35.532 self.register_birth()
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 273, in register_birth
April 15th 2023, 02:41:35.532 if self.connection.exists(self.key) and \
April 15th 2023, 02:41:35.532 File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
April 15th 2023, 02:41:35.532 conn = self.connection or pool.get_connection(command_name, **options)
April 15th 2023, 02:41:35.532 redis.exceptions.ConnectionError: Error 110 connecting to redash-redis-master:6379. Connection timed out.
April 15th 2023, 02:41:35.986 2023-04-15 00:41:35,986 INFO exited: worker-0 (exit status 1; not expected)
April 15th 2023, 02:41:36.987 2023-04-15 00:41:36,987 INFO gave up: worker-0 entered FATAL state, too many start retries too quickly
April 15th 2023, 02:42:01.011 2023/04/15 00:42:01 [worker_healthcheck] Received TICK_60 event from supervisor
April 15th 2023, 02:42:01.013 RESULT 2
April 15th 2023, 02:42:01.013 2023/04/15 00:42:01 [worker_healthcheck] No processes in state RUNNING found for process worker
April 15th 2023, 02:42:01.013 OKREADY
Liveness check for workers PR should improve the sittuation.