arq icon indicating copy to clipboard operation
arq copied to clipboard

Intermittent worker timeout

Open nicjac opened this issue 4 years ago • 11 comments

Hi everyone,

First of all thank you for the hard work on arq, we love it!

Recently, we have been encountering intermittent TimeoutError from workers:

TimeoutError: null
  File "arq", line 8, in <module>
    sys.exit(cli())
  File "click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "arq/cli.py", line 48, in cli
    run_worker(worker_settings_, **kwargs)
  File "arq/worker.py", line 630, in run_worker
    worker.run()
  File "arq/worker.py", line 230, in run
    self.loop.run_until_complete(self.main_task)
  File "asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "arq/worker.py", line 277, in main
    await self._poll_iteration()
  File "arq/worker.py", line 312, in _poll_iteration
    t.result()
  File "arq/worker.py", line 356, in run_job
    self.pool.expire(retry_key_prefix + job_id, 88400),
  File "aioredis/util.py", line 59, in wait_convert
    result = await fut
  File "aioredis/pool.py", line 257, in _wait_execute
    conn = await self.acquire(command, args)
  File "aioredis/pool.py", line 324, in acquire
    await self._fill_free(override_min=True)
  File "aioredis/pool.py", line 398, in _fill_free
    conn = await self._create_new_connection(self._address)
  File "aioredis/connection.py", line 113, in create_connection
    timeout)
  File "asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()

This is running on a Kubernetes cluster (EKS) and using AWS ElastiCache as Redis provider. Workers are started using the arq cli.

We tried increasing the timeout to 10 seconds, but same problem happens. This is very difficult to diagnose / investigate due to the intermittent nature of the issue.

Anyone encountered anything similar before?

Thanks!

nicjac avatar Feb 17 '21 10:02 nicjac

@nicjac I was seeing either the same issue or similar issue when running jobs that were blocking the event loop for longer than the job_timeout in worker settings. However, I'm not sure if that was the cause.

cmangla avatar Mar 03 '21 11:03 cmangla

@

@nicjac I was seeing either the same issue or similar issue when running jobs that were blocking the event loop for longer than the job_timeout in worker settings. However, I'm not sure if that was the cause.

Interesting, thanks! I do not know whether those errors were for jobs running for more than the default worker timeout value (300 seconds AFAIK?), but I will run some tests to see whether this could be the cause.

nicjac avatar Mar 03 '21 16:03 nicjac

@nicjac Are you running synchronous jobs? If so, Issue #230 and PR #232 might be of interest.

cmangla avatar Mar 04 '21 10:03 cmangla

I’ve also been running into this same issue, and haven’t had much luck finding anything helpful online.

I played around with really long job timeouts (as well as increasing poll_delay and lowering max_jobs), but still seeing intermittent timeouts with the same trace logs 😔

fredp4t1qbit avatar Mar 08 '21 18:03 fredp4t1qbit

Same, after playing with various settings, those timeouts still happen (albeit a bit less frequently now? Likely placebo).

If I can't figure this out I will have to consider an alternative, though I am worried that it will mean implementing much more complex job systems.

nicjac avatar Mar 08 '21 20:03 nicjac

And it wouldn’t be too big a problem if the job was requeued - with this Timeout exception, the job that the worker was working on is lost in the void. @cmangla does this seem like a feasible feature?

fredp4t1qbit avatar Mar 09 '21 00:03 fredp4t1qbit

And it wouldn’t be too big a problem if the job was requeued - with this Timeout exception, the job that the worker was working on is lost in the void.

According to arq documentation, "jobs aren’t removed from the queue until they’ve either succeeded or failed". So your worker should be able to restart that job if you restart the worker.

cmangla avatar Mar 09 '21 10:03 cmangla

And it wouldn’t be too big a problem if the job was requeued - with this Timeout exception, the job that the worker was working on is lost in the void.

According to arq documentation, "jobs aren’t removed from the queue until they’ve either succeeded or failed". So your worker should be able to restart that job if you restart the worker.

My guess is that this timeout exception isn't being handled appropriately, which is why we're seeing behaviour inconsistent from arq's documentation. Can you verify this @nicjac?

fredp4t1qbit avatar Mar 09 '21 19:03 fredp4t1qbit

v0.20 has just been released, please try it and let us know if this issue is fixed.

The other thing to look at is the number of redis connections being used, I know aioredis has some on going problems with connections leaking from pools (can't find the issue right now).

samuelcolvin avatar Apr 26 '21 12:04 samuelcolvin

v0.20 has just been released, please try it and let us know if this issue is fixed.

The other thing to look at is the number of redis connections being used, I know aioredis has some on going problems with connections leaking from pools (can't find the issue right now).

I found the solution. This is because aioredis reaches the timeout when creating connection pool. And the default timeout value set by arq is just 1 seconds, which is too short in some conditions. Thus the solution is that, just add the conn_timeout keywords parameters when declaring your Worker cls. For example, increase conn_timeout to 300 seconds.

class Worker:
    redis_settings = arq.connections.RedisSettings(database=0, conn_timeout=300)

Hope this helps.

2foil avatar May 21 '21 03:05 2foil

Hi @2foil, I know it's been a while since this issue was last updated but, I was wondering how you settled on 300 for the conn_timeout?

lolknicksT-T avatar Jul 26 '22 13:07 lolknicksT-T