RedLock.net icon indicating copy to clipboard operation
RedLock.net copied to clipboard

Random NoQuorum Errors when using a redis cluster with multiple nodes.

Open mikehank96 opened this issue 3 years ago • 9 comments

We have a cluster with 2 nodes and have been regularly getting random NoQuorum errors when creating many locks to it. I've experimented by making creating 500 locks to it and usually get from 1-20 NoQuorum errors. When doing the same with a single node redis instance I don't get the errors. The readme states Using replicated instances is not the suggested way to use RedLock but also says it supports it so I'm not sure if these errors are expected or not. Currently using RedLock.net 2.3.1.

mikehank96 avatar Sep 30 '21 20:09 mikehank96

Additional info: Factory creation:

var lockFactory = RedLockFactory.Create(new List<RedLockEndPoint>
				{
					new RedLockEndPoint
					{
						EndPoint = new DnsEndPoint(redisConfigurationEndpoint, port),
						Password = secret,
						Ssl = true,
						RedisKeyFormat = "MyKey_{0}",
					}
                                   });

Lock creation:

await _redLockFactory.CreateLockAsync(resource, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(1))

mikehank96 avatar Sep 30 '21 21:09 mikehank96

Are you two servers a master/replica configuration, a Redis Cluster (with keys sharded across both), or two independent servers?

And in your RedLockFactory configuration, are you only connecting to one of them?

samcook avatar Sep 30 '21 23:09 samcook

Redis Cluster. We're using terraform and it's all created in one aws_elasticache_replication_group resource. As for the configuration I'm using the configuration endpoint which I figured that handles all connections.

mikehank96 avatar Sep 30 '21 23:09 mikehank96

So I tried bosting the number of lock requests in my tests from 500 to 700 and now I get the NoQuorum errors from the single node redis instance as well so my it isn't a number of nodes issue.

mikehank96 avatar Oct 01 '21 14:10 mikehank96

After some more experimentation I realized if I increase the wait time for the lock creation the errors disappear. Is it possible for NoQuorum statuses to be returned for timeouts?

mikehank96 avatar Oct 04 '21 20:10 mikehank96

Yes, that is possible. If an attempt to acquire a lock in an instance doesn't complete within the timeout it is treated as a failure, and if there aren't enough successfully acquired instances to meet the quorum then it will fail with NoQuorum.

The quorum required is floor(n/2 + 1), where n is the number of independent instances you have.

Instances Quorum
1 1
2 2
3 2
4 3
5 3

samcook avatar Oct 05 '21 11:10 samcook

Does Redlock do some kind of introspection when it is passed the endpoint? This is being used with an AWS elasticache "cluster mode enabled" cluster - it has 3 shards and 6 nodes. AWS docs state that you should just do all your writes through their "connection endpoint"

Redis (cluster mode enabled) clusters, use the cluster's Configuration Endpoint for all operations that support cluster mode enabled commands. You must use a client that supports Redis Cluster (Redis 3.2). You can still read from individual node endpoints (In the API/CLI these are referred to as Read Endpoints). ( https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Endpoints.html )

Does Redlock (or the underlying stackexchange redis driver) somehow look up the cluster information from that single endpoint, or do we need to configure things differently in order to use a cluster like this in AWS. Somehow it must know that it doesn't just have a single instance (otherwise why would it throw errors about NoQuorum)

It seems that the recommended way to do it would be to use multiple (3) standalone redis nodes and pass each of them in as endpoints to the redlock and then redlock will maintain a quorum with those standalone nodes on its own?

ryangardner avatar Oct 06 '21 18:10 ryangardner

I haven't used Redis on AWS myself, so I'm not too sure whether they do things any differently to a standard Redis Cluster.

Does Redlock (or the underlying stackexchange redis driver) somehow look up the cluster information from that single endpoint, or do we need to configure things differently in order to use a cluster like this in AWS. Somehow it must know that it doesn't just have a single instance (otherwise why would it throw errors about NoQuorum)

RedLock.net doesn't do anything specific to look up cluster information - if anything happens there it would be within StackExchange.Redis.

If you are only providing one RedLockEndPoint (or one ConnectionMultiplexer, if you are using them directly) when you create your RedLockFactory then RedLock.net will treat your cluster as a single instance.

It is possible to get NoQuorum responses even with a single instance if that instance doesn't acquire a lock within the timeout period (in this situation the quorum is 1 and locks were acquired in 0 instances).

It seems that the recommended way to do it would be to use multiple (3) standalone redis nodes and pass each of them in as endpoints to the redlock and then redlock will maintain a quorum with those standalone nodes on its own?

Yes, that would be the suggested way to do it if you want more resilience than is offered by a single standalone instance.

samcook avatar Oct 06 '21 21:10 samcook

What's the difference between retries configured by RedLockRetryConfiguration and retries configured by CreateLockAsync? When setting RedLockRetryConfiguration instead of the CreateLockAsync ones the NoQuorum errors seemed to stop. Looking at the source code it looks like they almost do the same thing except one loops outside of AcquireAsync and one loops inside it.

mikehank96 avatar Oct 08 '21 15:10 mikehank96