rq icon indicating copy to clipboard operation
rq copied to clipboard

Discrepancies between documentation and code for `worker_ttl`

Open stv8 opened this issue 8 months ago • 5 comments

Originally posted in the discord group here, but wanted to repost as an issue in case any one stumbles on this as well.

https://discord.com/channels/844816706231861248/844816707126296578/1248726228677034144


There appears to be a discrepancy between how the documentation describes default_worker_ttl and how it is actually used in code.

I stumbled on this discrepancy because I've been running into some difficult issues where jobs appear to be stuck, very similar to what is described in this open issue https://github.com/rq/rq/issues/758 and am hoping that adjusting the socket_timeout will fix the issue (ultimately it didn't).

The leading suggestion in the above GH issue is to configure the socket_timeout param. In the RQ docs it mentions that RQ automatically configures socket_timeout to be 10s more than the default_worker_ttl

https://python-rq.org/docs/connections/#timeout

To avoid potential issues with hanging Redis commands, specifically the blocking BLPOP command, RQ automatically sets a socket_timeout value that is 10 seconds higher than the default_worker_ttl.

If you prefer to manually set the socket_timeout value, make sure that the value being set is higher than the default_worker_ttl (which is 420 by default).

However in the code it appears to be making the socket_timeout actually 5 seconds shorter than the default_worker_ttl, thus contradicting the documentation.

DEFAULT_WORKER_TTL = 420

_set_connection() calls self.connection_timeout

https://github.com/rq/rq/blob/0e15b2a942b8d724e938b04c030161422942d1ea/rq/worker.py#L302

timeout_config = {"socket_timeout": self.connection_timeout}

https://github.com/rq/rq/blob/0e15b2a942b8d724e938b04c030161422942d1ea/rq/worker.py#L781-L782

which is

    def connection_timeout(self) -> int:
        return self.dequeue_timeout + 10

which calls

    def dequeue_timeout(self) -> int:
        return max(1, self.worker_ttl - 15)

Therefore, connection_timeout = 420 - 15 + 10 => 415 seconds

While the documentation mentions it needs to be 10 seconds higher than the default_worker_timeout, the code ensures that it's 10 seconds higher than the dequeue_timeout which is an important distinction.

It appears that this code change is intentional according to this PR: https://github.com/rq/rq/pull/1794

The connection timeout is a safe guard: since there were reports of workers not reading from the queue, and there were reports of the hanging BLOP when used with Redis Py, we added the socket_timeout to the connection, so if the BLOP hangs, the socket timeout would raise a TimeoutError.

It would be great if the docs were updated to reflect the current state of the timeouts in the code.

Also I think it would be helpful to describe how the various timeouts are actually used. For example, what even is a worker_timeout? The best I can deduce is that it's a timeout only on the worker key in redis, but doesn't necessarily affect the worker process.

I hope this is helpful, let me know if I misunderstand anything or if there is anything I can do to help.

Thanks!

stv8 avatar Jun 11 '24 17:06 stv8