DistributedLock icon indicating copy to clipboard operation
DistributedLock copied to clipboard

How to distinguish Redis unavailability from lock contention in TryAcquireAsync?

Open unzeitip opened this issue 8 months ago • 4 comments

Hi, first of all, thank you for this great library — it’s been very helpful for us! 🙏

I have a question regarding how RedisDistributedLock.TryAcquireAsync behaves when Redis is unavailable (e.g., Redis crash, network failure, OOM).

From what I can see during my testing, when Redis is down, TryAcquireAsync seems to just return null, without throwing any exception. This makes it hard to distinguish between:

  • the lock being already held (normal contention), and
  • Redis being unreachable (infrastructure issue).

Is there currently a way to tell these two situations apart when using TryAcquireAsync?

unzeitip avatar Apr 27 '25 18:04 unzeitip

@unzeitip if Redis is down then intended behavior of TryAcquireAsync should be to throw an exception (that said, we can only do this if the underlying StackExchange client propagates this information to us).

If that's not the behavior and you see a path to fixing, I'd happily accept a PR!

madelson avatar May 14 '25 00:05 madelson

You can call Ping() on IDatabase (database.Ping()) wrapped in try/catch to see if it's not connected. I also use TryAcquireAsync just to detect a lock then follow it with AcquireAsync which honors the timeout. That way we can still have a sensible timeout and not crush the shared resource. At least this works with Redis.

Changing this behavior now could break a lot of code and cause really bad things. Probably best to document it in the C# doc method summary in the code and suggest calling database.Ping() to check.

But yes, I can confirm the behavior mentioned in the ticket, even if you pass a timeout to TryAcquireAsync (with Redis). It just returns a null handle.

I converted my code into pseudocode below, got rid of a lot of things for readability. That means it may not work out of the box

            var canConnect = true;
            try
            {
                redis.Ping();
            }
            catch(Exception ex)
            {
                logger.LogError(ex, "Cannot connect to Redis!");
                canConnect = false;
            }

            // For logging
            if (canConnect)
            {
                await using (var handle = await _semaphore.TryAcquireAsync())
                {
                    if (handle == null)
                    {
                        logger.LogInformation("Resource unavailable. Will wait.");
                    }
                }
            }

            try
            {
                // Now run
                return await _semaphore.AcquireAsync(timeout: TimeSpan.FromSeconds(100), cancellationToken);
            }
            catch (TimeoutException ex)
            {
                throw new MyCustomException("Handle this exception in caller esp if it has to still do something important!", ex);
            }

Although the code above handles intermittent outages of Redis, another thing I do in Program.cs is completely prevent startup if Redis is not available so we know something is wrong (meaning that instance will fail to launch in a cluster).


        var database = ConnectionMultiplexer.Connect(configuration["Redisconstring"]).GetDatabase();
        // We want the app to fail startup immediately if it cannot connect, otherwise a critical action will be blocked
        database.Ping(); // Will fail startup if not connected
        return database;

netdragonboberb avatar Jun 29 '25 04:06 netdragonboberb

@netdragonboberb are you saying that with the current implementation if Redis is down then TryAcquire with any timeout returns null without waiting for the full timeout to expire?

madelson avatar Jul 02 '25 01:07 madelson

Had a chance to investigate this; it's pretty easy to replicate in the test suite:

        var @lock = new RedisDistributedLock("abc", TestingRedis2x1DatabaseProvider.DeadDatabase);
        await using var handle = await @lock.TryAcquireAsync();
        Assert.IsNull(handle); // works because it times out

As far as I can tell, we aren't swallowing any errors. Rather, the issue is that the underlying connection hasn't given up on the database yet and is still trying to connect and the implicit timeout of 0 in TryAcquire times out before that does.

Ideally in a single-DB scenario I don't think TryAcquireAsync() should time out before the connection is determined to have succeeded or failed; that could lead to spurious failed acquires that would have been successful if the timeout were longer.

That said, I'm not sure how to fix this other than doing a PingAsync() every time which feels egregious (maybe we can eschew the ping if we check IsConnected?).

madelson avatar Oct 26 '25 14:10 madelson