StackExchange.Redis Required Redis ACL for Cluster

We are running our production Redis cluster in a 3 Master + 8 slave configuration and when we performed a manual master failover we observed many errors like the following until we reboot all of our clients connecting to the cluster.

ERR Error running script (call to xxxxxx): @user_script:2: @user_script: 2: -READONLY You can't write against a read only replica.

Originally we thought this was due to being on an older version of this library, but after updating to the latest we could still reproduce the above error. We now know that this was actually caused by a permission issue, but it isn't 100% clear to me what permissions this library actually requires after reading deeper into your ACL documentation https://redis.io/topics/acl.

I am wondering what ACL list is recommended or required for a cluster configuration like ours.

This our current ACL:

+@all -@admin -@dangerous

Will changing it to the following fix these errors?

+@all -@dangerous +psync +replconfig +info

Jan 03 '22 19:01 AlCapown

This doesn't seem to be ACL related to me. If it was working, it should continue working. This seems more likely to be simply fallout due to a cluster topology change. Now: as far as I know, redis doesn't automatically broadcast when the topology changes, so clients can get confused and end up with a different view of reality H this sending commands to the wrong servers. For slot changes, we can usually recover from this relatively gracefully via MOVED, but for read-only/read-write changes, it isn't perfect. When the library issues this kind of command, it tries to broadcast on an expected channel, so that clients reconnect. Or there is an explicit reconnect API. I wonder, though, if the real trick here would be to try harder to catch this category of error, and interpret it as a "hey, something changed; check the world" event.

Jan 03 '22 19:01 mgravell

I'm pretty confident it is ACL related though. We did test with just +@all and the issue went away. But for obvious security reasons, we don't want to set our clients will full access.

Jan 03 '22 20:01 AlCapown

@AlCapown There's a consideration when you did a failover if the client library knew about it. If we detected a down state in the heartbeat, then we reconfigured to the right servers (the most we can do, due to a lack of notice as Marc said when something drops). If this happened you'd see the same "it works now" symptoms regardless of the ACL change. There's no ACL related shenanigans I'm aware of here, and the error doesn't seem to indicate it's ACL either - it's that we're targeting and writing to the wrong server.

Can you correlate which server is getting hit (and errors) with whether it was a replica at that time or not?

Jan 03 '22 22:01 NickCraver

I should clarify that we never actually took the master node offline, we just promoted one of the slaves to be the new master. We had to do this was because we realized our 3 master nodes were not stripped across our 3 separate fault domains once we migrated our clients over to the cluster. Our IT team was able to see in one of their network management tools that the existing connections to the original master stayed active after we switched. In our tests where we pulled the virtual power plug out from one of the masters, we did not see this behavior and requests did successfully transfer to the new master node.

Is there a potential that this heartbeat only see's connection timeouts/failures? I agree that these errors don't seem ACL related which is why we first tried updating the client Redis lib to the latest 2.2.88 release since we saw there was a potential fix for this in 2.2.50 https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes#2250.

Jan 04 '22 15:01 AlCapown

In this case, the master node transitioned from writable to not writable, without the client knowing about it so yep...the behavior you're seeing makes sense. The question is how to best proceed here. If the promotion is done via the library, we handle/broadcast this but that's not an option for all teams - we need to see if there's an additional event we're somehow not listening for in cluster. Just so we can see options - what version of Redis server are you using?

Jan 04 '22 15:01 NickCraver

And as I intimated earlier: if the library can help here, we can tweak it. I don't know if we explicitly react to -READONLY and interpret it as a config change, but we absolutely could do. And if we already do, but don't handle that inside scripting replies: we can tweak that.

Jan 04 '22 19:01 mgravell

@mgravell ya know what? This may be related to the re-subscription issues after a reconnect I'm hitting in the PR. If it's not happening reliably...we wouldn't be picking up cluster change events either... Let's figure that out and report back!

Jan 05 '22 02:01 NickCraver

We are on the latest stable release of Redis server. 6.2.6

Jan 05 '22 14:01 AlCapown

Here is how you repro.

Using this simple console app and the following ACL configured in the .conf file (I was on .NET 6 and StackExchange.Redis version 2.2.88)

using StackExchange.Redis;

class Program
{
    static async Task Main(string[] args)
    {
        string connectionString = args[0];
        var connection = ConnectionMultiplexer.Connect(connectionString);
        var redis = connection.GetDatabase();

        for(int i = 0; true; i++)
        {
            Console.WriteLine($"Begin loop {i}");
            
            try
            {
                await redis.StringSetAsync("TestKey", i.ToString());
                Console.WriteLine($"Successfully set TestKey with value {i}");
            }
            catch(Exception ex)
            {
                Console.WriteLine(ex);
            }

            await Task.Delay(10_000);
        }
    }
}

user restricteduser +@all -@admin -@dangerous +info ~* on >restrictedpass
user adminuser +@all ~* on >adminpass

First use a connection string with the user=restricteduser and password=restrictedpass
Do a cluster failover on each of the original masters until you see the error I mentioned. (Login to a slave of a master and run cluster failover force)

If you do the test a second time but with user=adminuser and password=adminpass you will not see the error.

Jan 05 '22 15:01 AlCapown

The error will be a little different in my example but its still trying to do a SET command against a replica.

StackExchange.Redis.RedisConnectionException: InternalFailure on SET TestKey ---> StackExchange.Redis.RedisCommandException: Command cannot be issued to a replica: SET TestKey at StackExchange.Redis.PhysicalBridge.WriteMessageToServerInsideWriteLock(PhysicalConnection connection, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 1361

Jan 05 '22 15:01 AlCapown

You know what, I take back what I said about this being ACL related... Ran my tests again with full access and I can still reproduce the errors. The behavior is strange to me because sometimes it won't fail and other times it will fail one or more times before fixing itself.

Jan 05 '22 22:01 AlCapown

Yeah I'd expect this has nothing to do with ACLs and everything to do with "topology changed and we don't know about it". That being said, one of the topology change notifications we observe may not be working as Marc and I are investigating with a general subscriptions issue - we'll ping when that in place to see if it helps, if you're able to test a MyGet build...

Jan 05 '22 22:01 NickCraver

I just ran into the same issue. AWS ElastiCache with 3 nodes had a failover of the primary from node 1 to node 3. It seems that our services continue to attempt to write to node 1. Same error message as in the original post of this thread.

Jan 19 '22 15:01 tylerohlsen

Status update: working on improving this in #1947 which will land in the 2.5.x release. Subscriptions had done fun issues overall and got a ton of love.

Feb 06 '22 03:02 NickCraver

Do you have an estimate on the timeframe of when 2.5.x might be released?

Feb 09 '22 15:02 AlCapown

@NickCraver I pulled down the latest prerelease version of 2.5.24-prerelease and I can still reproduce the same error.

StackExchange.Redis.RedisConnectionException: InternalFailure on SET user:TestKey
 ---> StackExchange.Redis.RedisCommandException: Command cannot be issued to a replica: SET user:TestKey
   at StackExchange.Redis.PhysicalBridge.WriteMessageToServerInsideWriteLock(PhysicalConnection connection, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 1258

Feb 11 '22 18:02 AlCapown

@AlCapown Can you give me the full stack there please? Not 100% sure where that command is coming from and a full stack would help.

Feb 13 '22 15:02 NickCraver

@NickCraver Not sure if the full details is going to revial much for you. Its still that example program I listed out above that writes a string in a loop. Its happening when I call StringSetAsync

StackExchange.Redis.RedisConnectionException: InternalFailure on SET user:TestKey
 ---> StackExchange.Redis.RedisCommandException: Command cannot be issued to a replica: SET user:TestKey
   at StackExchange.Redis.PhysicalBridge.WriteMessageToServerInsideWriteLock(PhysicalConnection connection, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 1258
   --- End of inner exception stack trace ---
   at RedisCommandLoop.Program.Main(String[] args) in C:\Users\kyle6\Desktop\RedisCommandLoop\RedisCommandLoop\RedisCommandLoop\Program.cs:line 28

Feb 14 '22 23:02 AlCapown

I added documentation for this to https://stackexchange.github.io/StackExchange.Redis/Configuration#redis-server-permissions as a follow-up here. It outlines the combinations needed and why we're issuing each. If you don't want to allow it that's all good: we'll reduce functionality if we're able but commands should also be disabled in the CommandMap for us to not issues them :)

Sep 04 '22 17:09 NickCraver

StackExchange.Redis StackExchange.Redis copied to clipboard

Required Redis ACL for Cluster

StackExchange.Redis
StackExchange.Redis copied to clipboard