asynq icon indicating copy to clipboard operation
asynq copied to clipboard

[QUESTION] Could multiple worker instances cause task duplication in the queue due to the recovery mechanism?

Open WSUFan opened this issue 1 year ago • 4 comments

Hi @hibiken. I have a follow-up question regarding running multiple instances of workers that serve the same queue. As we have a recovery mechanism now, could there be a possibility that several recovery processes pick up and re-enqueue the same lease expired task multiple times (so the queue will have multiple entries of the same task ID)?

WSUFan avatar Aug 06 '23 05:08 WSUFan

@WSUFan thanks for pointing this out.

I believe you're right, this needs to be fixed. Approaches I can think of:

  1. take a lock on these redis keys (asynq:{}:lease) when running listLeaseExpired command (lock with TTL to avoid unfortunate situation)
  2. within listLeaseExpired, we could move these expired task ids to "stagin" area so that they are not in the asynq:{}:lease anymore

Any feedback is appreicated.

hibiken avatar Aug 06 '23 14:08 hibiken

yeah, maybe a redis lock is needed when operating with these keys. I would choose approach 1, haha

WSUFan avatar Aug 06 '23 17:08 WSUFan

While there is a race condition, is there an actual risk of task duplication?

Take 2 workers W1 and W2 and an expired task T1 and consider this timeline of sequential events

  • W1 calls ListLeaseExpired and gets T1
  • W2 calls ListLeaseExpired and gets T1
  • W1 calls retry It removes T1 from asynq:{<qname>}:active and asynq:{<qname>}:lease and moves it to asynq:{<qname>}:retry
  • W2 calls retry but it will fail because calling LREM on asynq:{<qname>}:active return 0 elements.

Now of course both W1 and W2 having the same expired tasks is not ideal and there is a race condition where in the last step above, W2 might LREM task T1 after it has been re-enqueued again.

So I think your proposal of a Redis lock around listLeaseExpired is good solution so each worker is dealing with disjoint set of expired tasks

yousifh avatar Aug 07 '23 14:08 yousifh

I believe another solution would be to have the client side manage this. We could execute the listLeaseExpired() function on the client side and retrieve the task initiated by this client. In this scenario, there would only be one retry (re-enqueuing the same task) from this client, assuming each task has a unique ID.

WSUFan avatar Aug 07 '23 16:08 WSUFan