cashews icon indicating copy to clipboard operation
cashews copied to clipboard

Locking in multiple concurrent processes can cause hitting redis repeatedly

Open nickderobertis opened this issue 10 months ago • 4 comments

I have been observing that occasionally some of my cached + locked functions get stuck calling set_lock repeatedly. This leads to millions of redis calls in a ~30 min span in our application.

Image

As I started to dig into this, I found other unexpected behavior with locking and opened #332. Based on that investigation, we need ttl=None on the regular cache and positive ttl on the lock to actually achieve proper locking right now.

When I write a script with that, multiprocessing, and a function with a random execution time, I can recreate the issue consistently.

import asyncio
import multiprocessing
import random
from cashews import Command, cache
from cashews.backends.interface import Backend
import redis

r = redis.Redis(
    host="localhost",
    port=6379,
    db=0,
)
r.flushall()

_counter = 0


async def _logging_middleware(call, cmd: Command, backend: Backend, *args, **kwargs):
    global _counter
    if cmd == Command.SET_LOCK:
        _counter += 1
    return await call(*args, **kwargs)


_middlewares = (_logging_middleware,)


cache.setup(
    "redis://localhost:6379/0",
    client_side=True,
    retry_on_timeout=True,
    middlewares=_middlewares,
)


@cache(ttl=None)
@cache.locked(ttl=5)
async def test():
    print("running function")
    await asyncio.sleep(random.random() * 0.1)
    return 1


async def _main():
    await test()
    print("set lock called", _counter, "times")


def main():
    asyncio.run(_main())


multiprocessing.set_start_method("fork")
for _ in range(2):
    multiprocessing.Process(target=main).start()

Outputs:

running function
running function
set lock called 1 times
set lock called 185 times

I think the issue is related to this while loop. If it calls self.set_lock and it returns False, then self.set_lock gets called again almost immediately. The implementation in redis passes nx=True and from my testing I see that returns None if the key is already set. So if a function gets into this loop while the key is already set, it will repeatedly hit the cache until the key is removed.

nickderobertis avatar Feb 22 '25 22:02 nickderobertis

As I dig into the implementation more, I think things are working as intended, it's just that we're using a very expensive strategy for locking. It seems to be the intention of the logic to repeatedly try to set the lock as fast as possible until it can acquire the lock, and this leads to many calls against the backend in a short span of time as well as slowing down the application with tasks repeatedly quickly getting scheduled on the event loop.

It is even possible to recreate the issue without multiprocessing:

import asyncio
from contextvars import ContextVar
import multiprocessing
import random
from cashews import Command, cache
from cashews.backends.interface import Backend
import redis

r = redis.Redis(
    host="localhost",
    port=6379,
    db=0,
)
r.flushall()

_counter = ContextVar[int]("_counter", default=0)


async def _logging_middleware(call, cmd: Command, backend: Backend, *args, **kwargs):
    global _counter
    if cmd == Command.SET_LOCK:
        _counter.set(_counter.get() + 1)
    return await call(*args, **kwargs)


_middlewares = (_logging_middleware,)


cache.setup(
    "redis://localhost:6379/0",
    client_side=True,
    retry_on_timeout=True,
    middlewares=_middlewares,
)


@cache.locked(ttl=5, key="test")
async def test(val: int):
    await asyncio.sleep(random.random() * 1)
    print("set lock called in task", val, "for", _counter.get(), "times")
    return 1


async def _main():
    await asyncio.gather(*[test(i + 1) for i in range(3)])


def main():
    asyncio.run(_main())


main()

Outputs:

set lock called in task 2 for 1 times
set lock called in task 3 for 728 times
set lock called in task 1 for 2542 times

One simple fix is to add an (ideally configurable) wait time in between lock checks. Even adding a 50ms wait decreases the number of calls to the backend by 100x in a quick test of that script on my machine:

set lock called in task 3 for 1 times
set lock called in task 2 for 7 times
set lock called in task 1 for 23 times

Another more efficient approach at least in the Redis backend would be to subscribe to key deletion events so that you can wait for the event rather than checking the existence of the lock key repeatedly. And the lock TTL should cover any case where a network outage led to not receiving the event. I'm going to spend some time looking into this approach to see how it will work out, because it should be even more efficient than adding the wait while achieving better performance.

nickderobertis avatar Feb 23 '25 20:02 nickderobertis

Ok I think they key deletion event approach is too complex:

  1. It requires configuring the Redis server, which this lib has not attempted to do yet, and while we might be able to do that for OSS Redis we might run into issues trying to configure managed services like Elasticache that seem to want you to configure it a different way.
  2. Unlike regular Redis pub/sub, keyspace notifications specifically requires subscribing to each node within a cluster, which would likely bring about breaking changes to connection semantics in the lib

Perhaps a regular pub/sub channel could be used for this and I see the client version of the Redis backend is already using a pub/sub channel. We might even be able to use notifications on that same channel. But working with the internals is a little tricky and I'm struggling to get lock called on the backend directly without messing something else up e.g. AttributeError: 'BcastClientSide' object has no attribute '_client'. I'm running out of time to work on this so I think I'll just put up a PR around adding a check interval.

nickderobertis avatar Feb 23 '25 21:02 nickderobertis

Thanks for reporting and such huge investigation and contribution, appreciate it. PR was merged.

I need to think about a better mechanism, I had some thoughts, but now I have a newborn baby and everything "faded away"

Let me keep this issue open to not forget about it.

Krukov avatar Mar 09 '25 19:03 Krukov

Great, thank you and congratulations on the baby @Krukov ! I understand you'll have your hands full for a while but when you get a bit of downtime would love to hear your thoughts. I might have some time in the coming weeks to hack on a better approach.

nickderobertis avatar Mar 10 '25 13:03 nickderobertis