arq
arq copied to clipboard
Intermittent worker timeout
Hi everyone,
First of all thank you for the hard work on arq
, we love it!
Recently, we have been encountering intermittent TimeoutError
from workers:
TimeoutError: null
File "arq", line 8, in <module>
sys.exit(cli())
File "click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "click/core.py", line 782, in main
rv = self.invoke(ctx)
File "click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "arq/cli.py", line 48, in cli
run_worker(worker_settings_, **kwargs)
File "arq/worker.py", line 630, in run_worker
worker.run()
File "arq/worker.py", line 230, in run
self.loop.run_until_complete(self.main_task)
File "asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "arq/worker.py", line 277, in main
await self._poll_iteration()
File "arq/worker.py", line 312, in _poll_iteration
t.result()
File "arq/worker.py", line 356, in run_job
self.pool.expire(retry_key_prefix + job_id, 88400),
File "aioredis/util.py", line 59, in wait_convert
result = await fut
File "aioredis/pool.py", line 257, in _wait_execute
conn = await self.acquire(command, args)
File "aioredis/pool.py", line 324, in acquire
await self._fill_free(override_min=True)
File "aioredis/pool.py", line 398, in _fill_free
conn = await self._create_new_connection(self._address)
File "aioredis/connection.py", line 113, in create_connection
timeout)
File "asyncio/tasks.py", line 449, in wait_for
raise futures.TimeoutError()
This is running on a Kubernetes cluster (EKS) and using AWS ElastiCache as Redis provider. Workers are started using the arq
cli.
We tried increasing the timeout to 10 seconds, but same problem happens. This is very difficult to diagnose / investigate due to the intermittent nature of the issue.
Anyone encountered anything similar before?
Thanks!
@nicjac I was seeing either the same issue or similar issue when running jobs that were blocking the event loop for longer than the job_timeout
in worker settings. However, I'm not sure if that was the cause.
@
@nicjac I was seeing either the same issue or similar issue when running jobs that were blocking the event loop for longer than the
job_timeout
in worker settings. However, I'm not sure if that was the cause.
Interesting, thanks! I do not know whether those errors were for jobs running for more than the default worker timeout value (300 seconds AFAIK?), but I will run some tests to see whether this could be the cause.
@nicjac Are you running synchronous jobs? If so, Issue #230 and PR #232 might be of interest.
I’ve also been running into this same issue, and haven’t had much luck finding anything helpful online.
I played around with really long job timeouts (as well as increasing poll_delay and lowering max_jobs), but still seeing intermittent timeouts with the same trace logs 😔
Same, after playing with various settings, those timeouts still happen (albeit a bit less frequently now? Likely placebo).
If I can't figure this out I will have to consider an alternative, though I am worried that it will mean implementing much more complex job systems.
And it wouldn’t be too big a problem if the job was requeued - with this Timeout exception, the job that the worker was working on is lost in the void. @cmangla does this seem like a feasible feature?
And it wouldn’t be too big a problem if the job was requeued - with this Timeout exception, the job that the worker was working on is lost in the void.
According to arq documentation, "jobs aren’t removed from the queue until they’ve either succeeded or failed". So your worker should be able to restart that job if you restart the worker.
And it wouldn’t be too big a problem if the job was requeued - with this Timeout exception, the job that the worker was working on is lost in the void.
According to arq documentation, "jobs aren’t removed from the queue until they’ve either succeeded or failed". So your worker should be able to restart that job if you restart the worker.
My guess is that this timeout exception isn't being handled appropriately, which is why we're seeing behaviour inconsistent from arq's documentation. Can you verify this @nicjac?
v0.20 has just been released, please try it and let us know if this issue is fixed.
The other thing to look at is the number of redis connections being used, I know aioredis has some on going problems with connections leaking from pools (can't find the issue right now).
v0.20 has just been released, please try it and let us know if this issue is fixed.
The other thing to look at is the number of redis connections being used, I know aioredis has some on going problems with connections leaking from pools (can't find the issue right now).
I found the solution. This is because aioredis
reaches the timeout when creating connection pool. And the default timeout value set by arq
is just 1 seconds, which is too short in some conditions.
Thus the solution is that, just add the conn_timeout
keywords parameters when declaring your Worker cls.
For example, increase conn_timeout
to 300 seconds.
class Worker:
redis_settings = arq.connections.RedisSettings(database=0, conn_timeout=300)
Hope this helps.
Hi @2foil, I know it's been a while since this issue was last updated but, I was wondering how you settled on 300 for the conn_timeout
?