rq
rq copied to clipboard
Discrepancies between documentation and code for `worker_ttl`
Originally posted in the discord group here, but wanted to repost as an issue in case any one stumbles on this as well.
https://discord.com/channels/844816706231861248/844816707126296578/1248726228677034144
There appears to be a discrepancy between how the documentation describes default_worker_ttl
and how it is actually used in code.
I stumbled on this discrepancy because I've been running into some difficult issues where jobs appear to be stuck, very similar to what is described in this open issue https://github.com/rq/rq/issues/758 and am hoping that adjusting the socket_timeout
will fix the issue (ultimately it didn't).
The leading suggestion in the above GH issue is to configure the socket_timeout
param. In the RQ docs it mentions that RQ automatically configures socket_timeout
to be 10s more than the default_worker_ttl
https://python-rq.org/docs/connections/#timeout
To avoid potential issues with hanging Redis commands, specifically the blocking BLPOP command, RQ automatically sets a socket_timeout value that is 10 seconds higher than the default_worker_ttl.
If you prefer to manually set the socket_timeout value, make sure that the value being set is higher than the default_worker_ttl (which is 420 by default).
However in the code it appears to be making the socket_timeout
actually 5 seconds shorter than the default_worker_ttl
, thus contradicting the documentation.
DEFAULT_WORKER_TTL = 420
_set_connection()
calls self.connection_timeout
https://github.com/rq/rq/blob/0e15b2a942b8d724e938b04c030161422942d1ea/rq/worker.py#L302
timeout_config = {"socket_timeout": self.connection_timeout}
https://github.com/rq/rq/blob/0e15b2a942b8d724e938b04c030161422942d1ea/rq/worker.py#L781-L782
which is
def connection_timeout(self) -> int:
return self.dequeue_timeout + 10
which calls
def dequeue_timeout(self) -> int:
return max(1, self.worker_ttl - 15)
Therefore, connection_timeout = 420 - 15 + 10
=> 415 seconds
While the documentation mentions it needs to be 10 seconds higher than the default_worker_timeout
, the code ensures that it's 10 seconds higher than the dequeue_timeout which is an important distinction.
It appears that this code change is intentional according to this PR: https://github.com/rq/rq/pull/1794
The connection timeout is a safe guard: since there were reports of workers not reading from the queue, and there were reports of the hanging BLOP when used with Redis Py, we added the socket_timeout to the connection, so if the BLOP hangs, the socket timeout would raise a TimeoutError.
It would be great if the docs were updated to reflect the current state of the timeouts in the code.
Also I think it would be helpful to describe how the various timeouts are actually used. For example, what even is a worker_timeout
? The best I can deduce is that it's a timeout only on the worker key in redis, but doesn't necessarily affect the worker process.
I hope this is helpful, let me know if I misunderstand anything or if there is anything I can do to help.
Thanks!