django-redis-cache icon indicating copy to clipboard operation
django-redis-cache copied to clipboard

get_or_set function not parallel safe

Open ioparaskev opened this issue 6 years ago • 1 comments

Issue: I've just discovered (the bad way) that get_or_set function is not multiprocess safe. There seems to be a race condition here

How to reproduce: Scenario to reproduce (obviously this will require some runs since parallelism tests are fuzzy): Use a get_or_set function call to to get a specific value. It seems that because get_or_set checks for a specific non-unique lock key, there are cases that after realizing that the requested key is not set (and before caling the callback func), when it checks for the lock key it is already set (because parallelism) and as a result returns the original None value not calling the callback function.

The above are clear by just reading the code, though I thought some explanation would do no harm

Possible solutions:

  1. Remove the lock key. I don't really understand why it is there in the first place. My thoughts are that it could as well skip the locking and override the key with the new callback call
  2. If lock key exists, try to fetch and return the key for the second time. This can be trickier than it looks since the original callback function might take some time to execute resulting in returning again None (if we don't wait for the key to be populated)
  3. If lock key exists wait until the requested key is populated and return its' value

Solutions 1 and 2 are easy, though I can imagine scenarios where solution 3 would be more preferable (if for example the callback function is expensive in processing power or memory footprint). So strongly suggest solution 3

ioparaskev avatar Apr 04 '18 14:04 ioparaskev

@ioparaskev: The lock is there to prevent multiple functions running the callback at the same time, to prevent the "thundering herd" problem (BTW - those docs are hard to find @sebleier - buried after a changelog in the README, and not in the API docs. It would make more sense to have the changelog at the bottom of the README in my opinion, or just move those usage docs and notes into the other docs).

As I see it, the problem is in the acquired == False situation, the code simply returns value, even if value is None - that is, for the initial case of the cache key not existing, and the race condition described, some calls can return None.

I think the best solution here is to have this pattern:

    if acquired:
        ...
    else if value is None:
        return func()

    return value

In this case we will not have full thundering herd protection, but the cache key should still get populated (by the other process which is executing func and has the lock). The alternative would be waiting around for the other process to finish and then using its value, but that could be pretty fragile - what if it doesn't finish for some reason? And how long do we wait? We could end up waiting longer than it would take to run func.

It would be better to definitely get the correct return value, even if we are duplicating some work, than to have fragility here.

spookylukey avatar Aug 21 '19 07:08 spookylukey