django-cache-memoize icon indicating copy to clipboard operation
django-cache-memoize copied to clipboard

"Dog-piling" prevention support

Open lamby opened this issue 6 years ago • 18 comments

Under very heavy load when the cached version of that page expires or is missing, there may be sufficient concurrency that multiple workers will all attempt to evaluate the target function simultatenously. This could be "problematic".

Not sure about the best solution here, perhaps some very loose coupling with various locking mechanisms via a path.to.lock.method?

lamby avatar Nov 04 '17 09:11 lamby

How do you think that could be solved? If you run something like uwsgi you could use a global variable which I think is best solved with https://docs.python.org/3/library/functools.html#functools.lru_cache

peterbe avatar Aug 10 '18 12:08 peterbe

Interesting. I guess the locking should really be cross "instance".. I mean, if you have a single computer/worker then a global variable would work (despite being naturally icky..) but if you have a large number of machines/workers this would still result in one's database being hammered.

Oh, note that the naive solution of pushing it off to other projects, ie:

@cache_memoize(100)
def expensive():
    with redis.Lock('key', …):
        return 999999 ** 999999999

.. won't actually DTRT :)

lamby avatar Aug 10 '18 15:08 lamby

The whole point of django-cache-memoize is to use a central memoization. It already does that.

What I think you're talking about is potentially locking the function for other callers whilst processing the first caller. Was that what you had in mind? I.e. you could do something like this:

def memoize_decorator(function):
    def inner(*args, **kwargs):
        cache_key = make_cache_key(*args, **kwargs)
        while cache.get(cache_key + ':processing'):
            time.sleep(1)
        result = cache.get(cache_key)
        if result is None:
            cache.set(cache_key + ':processing', True, 60)
            try:
                result = function(*args, **kwargs)
            finally:
                cache.delete(cache_key + ':processing')
            cache.set(cache_key, result, TTL)
        return result
    return inner

Was that what you had in mind?

peterbe avatar Aug 10 '18 18:08 peterbe

is to use a central memoization

Indeed which is (usually) implied to be global by one's Django cache backend. However, the Django cache primitives unfortunately do not provide a locking API call. :)

you could do something like this:

Something like that. But, alas, the time.sleep(1) is rather sub-optimal given you are trying to increase throughout (!), there is a race condition between the cache.get(cache_key) and cache.set(cache_key + ':processing', True, 60) and this is somewhat problematic if function takes more than 60 seconds to execute :)

lamby avatar Aug 10 '18 20:08 lamby

It's not easy. Especially when you're distributed across Python processes, but the only way it can be solved is with a central cache.

One possible solution is to be aware of how many Python nodes you have and change a setting accordingly. If you know you only have 1 node (e.g. 1 EC2 instance) you can use Python's builtin lock and if you know you have multiple, you opt for the use of a global cache. But still, even with that there is a race-condition risk if you two independent, near simultaneous, cache-as-a-lock requests come in at the same time.

peterbe avatar Aug 13 '18 13:08 peterbe

the only way it can be solved is with a central cache.

Hm, I think we are agreeing on a topic that is orthogonal to this issue. :) I guess we need a redis.Lock that's a "real" semaphore.. ;)

lamby avatar Aug 13 '18 13:08 lamby

That's interesting actually. I always/only use Redis for my Django caching so we could make it configurable. I.e. if you know you have Redis and you worry about executing a function (on cold caches) simultaneously you enable this. Wanna take a stab?

peterbe avatar Aug 13 '18 15:08 peterbe

I won't be able to commit bandwidth to this soon but I think it would be fair to simply assume redis when one wants this dog-piling prevention. Do note that — as mentioned in https://github.com/peterbe/django-cache-memoize/issues/5#issuecomment-412115632 — the naive solution does not do the right thing as it will "recheck" the cache is warm after waiting on the lock.. but we don't want to invert the logic and serialise checking of the cache in the common/initial case!

lamby avatar Aug 13 '18 15:08 lamby

Turns out, I needed this. And since I use Redis for my Django cache backend, I could use cache.lock(). See this blog post: https://www.peterbe.com/plog/django-lock-decorator-with-django-redis

Basically, we could add an option to django-cache-memoize that does this stuff but we'll put in the docs "You have to use django-redis for this to work".

Or, for people who don't use django-redis but have a Redis connection available, they could pass in an instance of it. E.g.

@cache_memoize(100, lock=True)
def expensive_function(start, end):
    return random.randint(start, end)

Or

import redis.client
_redis = redis.client.Redis(config)


@cache_memoize(100, lock=True, redis=_redis)
def expensive_function(start, end):
    return random.randint(start, end)

What do you think?

peterbe avatar Aug 15 '18 12:08 peterbe

Oh. @lamby you were the one who RT'ed this on Twitter. Hi!

peterbe avatar Aug 15 '18 12:08 peterbe

Hello @peterbe. :) However, as mentioned above this does not do the Right Thing specifically in that it either seralises all access to the cache (icky, slow, etc...), or if only the underlying calls to expensive_function are serialised the naive solution will result in this being called each time by each client who has to queue, instead of the clients who had to wait being served the result generated by the first client who had a cache miss.

lamby avatar Aug 15 '18 12:08 lamby

Good point. A "solution" is to use both django-cache-memoize as always and use something like that @lock_decorator from the blog post. Right?

@lock_decorator()
@cache_memoize(100)
def expensive_function(start, end):
    return random.randint(start, end)

Then, if two concurrent requests come in, the one that is a microsecond later should get the cached result. In theory.

Perhaps, if my logic is sound, the resolution is to mention this pattern in the README.

peterbe avatar Aug 15 '18 13:08 peterbe

Doesn't that still serialise all access to the cache as the @lock_decorator() is outside the memoize?

lamby avatar Aug 16 '18 08:08 lamby

I don't know if I understand. Perhaps we're talking about different things.

The example just above works very well in theory. Two highly concurrent requests (even one different web heads) are heading towards the @lock_decorator. One of them is a smidge ahead and starts the lock and with the lock held, it proceeds to execute the expensive_function and when it's done it populates the cache. Meanwhile, the second request that was a smidge behind is stuck in the redis.Lock but as soon as that's been released, it continues into the @cache_memoize and there it can read from the cache and thus not execute expensive_function.

Isn't that ideal? Highly concurrent callers and only one global execution of the expensive function.

peterbe avatar Aug 16 '18 10:08 peterbe

Highly concurrent callers

Not quite as they are serialised access to the cache via the lock; they should be able to read the lock in parallel. We just need a variation on https://en.wikipedia.org/wiki/Readers%E2%80%93writers_problem etc.

lamby avatar Aug 16 '18 13:08 lamby

Can you demonstrate that problem?

If I have...:

@lock_decorator()
@cache_memoize(100)
def expensive_function(start, end):
    print("DOING FANCY WORK")
    return random.randint(start, end)

And run two threads/processes against this (with a timer), I'd expect the output to:

Thread 1: Took 2.1 seconds
Thread 2: Took 2.11 seconds

and on stdout I would only get one:

DOING FANCY WORK

No?

peterbe avatar Aug 16 '18 14:08 peterbe

I'm not sure what your testcase is trying to show, sorry. All access to the cache is now serialised, instead of being trivally paralisable. ie. only one request can /check/ the cache at a time whilst "before" this was perfectly safe and legitimate.

lamby avatar Aug 16 '18 15:08 lamby

locking should be initiated only if cache returns nothing, just before calculation. This way most of the time cache access is not serialized. best way would be to combine this with refreshing cached data just before cache has expired.

atocyo avatar Apr 03 '24 18:04 atocyo