redash Workers in FATAL state after redis connection drops out

Workers in FATAL state after redis connection drops out

Open pjaak opened this issue 1 year ago • 3 comments

Issue Summary

After a while the Redash ad-hoc workers stop processing jobs. I found these in the error logs:

2022-09-21 07:14:26,660 INFO gave up: worker-0 entered FATAL state, too many start retries too quickly 2022-09-21 07:14:26,661 INFO gave up: worker-1 entered FATAL state, too many start retries too quickly

So basically if the worker loses network connectivity to redis it will try to restart, but it looks like after 3 times it will go into a FATAL state which means the worker containers are dead and need to be restarted. We run are trying to run Redash on kubernetes and we use AWS spot instances, so there may be times where Redis is down for a short period and we don't want the workers to go into a FATAL state so we have to restart the containers each time.

Steps to Reproduce

Scale the redis container down to 0.
Wait like 30 seconds and the containers will go into FATAL state and not process anymore queries.
Scale redis back to 1.
Redash still wont process any queries because it is in FATAL state.

Technical details:

Redash Version: 10.0.0
Browser/OS: Chrome
How did you install Redash: Kubernetes helm chart

Sep 21 '22 07:09 pjaak

Bump

Sep 29 '22 03:09 pjaak

We have the same problem on GCP. I wonder if it is possible to have a livenessProbe on the workers that would terminate the Pod if it lost connection to Redis. This is not ideal but would at least recover the system from a failed state and allow queries to be processed again.

Oct 26 '22 13:10 acondrat

The same problem happens.

We use external redises+sentinels and haproxy before it. It happens very often.

I dint find any additional settings for redis in redash

Oct 27 '22 09:10 Bralva

I am testing out a potential workaround for this problem by making supervisord keep retrying to start the worker process in the event of a failure. It has this configuration parameter called startretries.

The number of serial failure attempts that supervisord will allow when attempting to start the program before giving up and putting the process into an FATAL state.

Setting this to a value high enough seems to allow the worker to recover in a event of redis communication failure. I tested taking redis down for a couple of minutes and then starting it back.

The config for supervisord is in https://github.com/getredash/redash/blob/master/worker.conf. I mounted a custom worker.conf with this startretries=999999 to make it retry forever.

[program:worker]
    command=./manage.py rq worker %(ENV_QUEUES)s
    process_name=%(program_name)s-%(process_num)s
    numprocs=%(ENV_WORKERS_COUNT)s
    directory=/app
    stopsignal=TERM
    autostart=true
    autorestart=true
    startretries=999999 # forever
...

This is not ideal but seems to do the job. I'd prefer to have a proper healthcheck endpoint or script you can call to have Kubernetes restart the Pod if the worker is in a FATAL state.

Dec 06 '22 17:12 acondrat

redash redash copied to clipboard

Workers in FATAL state after redis connection drops out

Issue Summary

Steps to Reproduce

Technical details:

redash
redash copied to clipboard