redash
redash copied to clipboard
Workers in FATAL state after redis connection drops out
Issue Summary
After a while the Redash ad-hoc workers stop processing jobs. I found these in the error logs:
2022-09-21 07:14:26,660 INFO gave up: worker-0 entered FATAL state, too many start retries too quickly 2022-09-21 07:14:26,661 INFO gave up: worker-1 entered FATAL state, too many start retries too quickly
So basically if the worker loses network connectivity to redis it will try to restart, but it looks like after 3 times it will go into a FATAL state which means the worker containers are dead and need to be restarted. We run are trying to run Redash on kubernetes and we use AWS spot instances, so there may be times where Redis is down for a short period and we don't want the workers to go into a FATAL state so we have to restart the containers each time.
Steps to Reproduce
- Scale the redis container down to 0.
- Wait like 30 seconds and the containers will go into FATAL state and not process anymore queries.
- Scale redis back to 1.
- Redash still wont process any queries because it is in FATAL state.
Technical details:
- Redash Version: 10.0.0
- Browser/OS: Chrome
- How did you install Redash: Kubernetes helm chart
Bump
We have the same problem on GCP.
I wonder if it is possible to have a livenessProbe
on the workers that would terminate the Pod if it lost connection to Redis. This is not ideal but would at least recover the system from a failed state and allow queries to be processed again.
The same problem happens.
We use external redises+sentinels and haproxy before it. It happens very often.
I dint find any additional settings for redis in redash
I am testing out a potential workaround for this problem by making supervisord keep retrying to start the worker process in the event of a failure. It has this configuration parameter called startretries.
The number of serial failure attempts that supervisord will allow when attempting to start the program before giving up and putting the process into an FATAL state.
Setting this to a value high enough seems to allow the worker to recover in a event of redis communication failure. I tested taking redis down for a couple of minutes and then starting it back.
The config for supervisord is in https://github.com/getredash/redash/blob/master/worker.conf. I mounted a custom worker.conf
with this startretries=999999
to make it retry forever.
[program:worker]
command=./manage.py rq worker %(ENV_QUEUES)s
process_name=%(program_name)s-%(process_num)s
numprocs=%(ENV_WORKERS_COUNT)s
directory=/app
stopsignal=TERM
autostart=true
autorestart=true
startretries=999999 # forever
...
This is not ideal but seems to do the job. I'd prefer to have a proper healthcheck
endpoint or script you can call to have Kubernetes restart the Pod if the worker is in a FATAL state.