lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

Not able to read the updated Master when connected Master/Slave through sentinel

Open UdCa-Codes opened this issue 4 years ago • 9 comments

Bug Report

Im using Master/Slave connection through sentinel and using the below configuration to connect to redis sentinel.

Current Behavior

we hosted our app in kubernetes and when the master node fails for the first few times, Lettuce is responding by updating the sentinel configuration, after that its returning as error saying the "Cannot find the master node"

public RedisConnectionFactory redisConnectionFactory() {
LettuceClientConfiguration clientConfig = LettuceClientConfiguration.builder()
                    .readFrom(MASTER)
                    .build();
            RedisSentinelConfiguration sentinelConfig = new RedisSentinelConfiguration()
                    .master("mysentinelmaster")
		    .build();    
                sentinelConfig.sentinel(xxx,6783);
            return new LettuceConnectionFactory(sentinelConfig,clientConfig);
}
@Bean
    public RedisTemplate<?, ?> redisTemplate() {
        RedisTemplate<?, ?> template = new RedisTemplate<>();
        template.setConnectionFactory(redisConnectionFactory());
        return template;
    }

here are the logs

org.springframework.data.redis.RedisSystemException: Redis exception; nested exception is io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE]]
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:74)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:268)
	at org.springframework.data.redis.connection.lettuce.LettuceSetCommands.convertLettuceAccessException(LettuceSetCommands.java:520)
	at org.springframework.data.redis.connection.lettuce.LettuceSetCommands.sRem(LettuceSetCommands.java:394)
	at org.springframework.data.redis.connection.DefaultedRedisConnection.sRem(DefaultedRedisConnection.java:649)
	at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:61)
	at com.sun.proxy.$Proxy135.sRem(Unknown Source)
	at org.springframework.data.redis.core.RedisKeyValueAdapter$MappingExpirationListener.lambda$onMessage$1(RedisKeyValueAdapter.java:787)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:224)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:184)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:171)
	at org.springframework.data.redis.core.RedisKeyValueAdapter$MappingExpirationListener.onMessage(RedisKeyValueAdapter.java:785)
	at org.springframework.data.redis.listener.RedisMessageListenerContainer.executeListener(RedisMessageListenerContainer.java:250)
	at org.springframework.data.redis.listener.RedisMessageListenerContainer.processMessage(RedisMessageListenerContainer.java:240)
	at org.springframework.data.redis.listener.RedisMessageListenerContainer.lambda$dispatchMessage$0(RedisMessageListenerContainer.java:986)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE]]
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster(MasterSlaveConnectionProvider.java:304)
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getConnectionAsync(MasterSlaveConnectionProvider.java:153)
	at io.lettuce.core.masterslave.MasterSlaveChannelWriter.write(MasterSlaveChannelWriter.java:66)
	at io.lettuce.core.RedisChannelHandler.dispatch(RedisChannelHandler.java:187)
	at io.lettuce.core.StatefulRedisConnectionImpl.dispatch(StatefulRedisConnectionImpl.java:152)
	at io.lettuce.core.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:467)
	at io.lettuce.core.AbstractRedisAsyncCommands.srem(AbstractRedisAsyncCommands.java:1367)
	at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:57)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy131.srem(Unknown Source)
	at org.springframework.data.redis.connection.lettuce.LettuceSetCommands.sRem(LettuceSetCommands.java:392)
	... 15 common frames omitted

Expected behavior/code

Sentinel should refresh the topology continuously even after receiving updated master details from sentinel if there is some issue with the current details

Environment

  • Lettuce version(s): [5.1.0.RELEASE]
  • Redis version: [e.g. 4.0.9]

UdCa-Codes avatar May 22 '20 01:05 UdCa-Codes

Lettuce listens continously to Sentinel Pub/Sub channels for topology updates. This approach is the most elaborate one as Sentinels actively publish changes in master and replica configuration. Can you provide a simple, reproducible test case or the logs from the time of the failover until the command failure?

mp911de avatar May 22 '20 07:05 mp911de

@mp911de I face that issue intermittently , I tried it few times and here are the logs that i see. Steps:

1.I failed the Redis Master and it recovered fine. 2.Brought the master up and brought down master again.

i face the below errors after repeating the above steps few times(attached the logs Redis Issue.pdf )

Im actually unsure if there is something im missing in the configuration or it is a bug. Or should i manually trigger the Topology Refresh or Pub/Sub channels from my configuration?

UdCa-Codes avatar Jun 01 '20 15:06 UdCa-Codes

Hello guys! We're facing the same issue described by @UdCa-Codes in our setup, which actually consists in a master-slave setup with Sentinel.

This is the driver configuration we're using:

application.store.master-name=<master-name>
application.store.hosts=<sentinel-address>
application.store.port=<sentine-port>
application.store.username=<username>
application.store.password=<password>
application.store.commandTimeout=15s

We're using the version 6.0.1, but we're facing this issue for a long time in other versions. Seems to happens quite randomly. From the Redis perspective, the Master election occurs with no erros and we're able to login in the Sentinel and discover the master normally through redis-cli for example. From the application perspective, sometimes we need to reboot it to effectively discover the Master through Sentinel.

In the application's log we see this kind of message:

Master is currently unknown: [RedisUpstreamReplicaNode [redisURI=redis://****@<previous-master-IP> role=REPLICA], RedisUpstreamReplicaNode [redisURI=redis://****@<previous-replica-IP>, role=REPLICA]]

Is there some kind of debugging we can do to get more information from the driver itself?

maiconbaumx avatar Dec 16 '20 18:12 maiconbaumx

I don't know if it's completely related, but we faced a bit of a similar issue some time ago where newly started applications would sometimes complain about not finding a master and staying in a broken state. The only way to 'fix' them was by sending a SENTINEL RESET to one of the Sentinel servers, so all clients would get info sent by the Sentinels. We couldn't quite figure out what was going wrong through the debug logging. After more digging, we did stumble upon a change some time ago where we reduced the Redis timeout from 10s to 1s. After reverting this change, the problem stopped occurring.

siwyd avatar Dec 17 '20 08:12 siwyd

I am seeing the same (or a similiar) issue as this. I think my scenario was roughly the following (this was on my local dev setup, luckily):

  1. Redis Sentinel processes were killed (by me manually)

  2. One of the Redis nodes was firewalled (sudo iptables -A INPUT -p tcp --dport 6379 -j DROP)

  3. Our application was then connecting to the Redis Sentinel cluster, using Lettuce 5.3.0 and some connect code that looks similar to this (manually recreated, the actual code is spread out over multiple places in our case and conditionalized to support both single-node and Sentinel setups):

            Builder redisUriBuilder = RedisURI.builder()
                 .withSentinelMasterId( sentinelMasterId) )
                 .withPassword( password );
    
            redisUriBuilder.withSentinel( host1, port, password) );
            redisUriBuilder.withSentinel( host2, port, password) );
            redisUriBuilder.withSentinel( host3, port, password) );
    
            RedisURI redisUri = redisUriBuilder.build();
    
            Builder clientResourcesBuilder = DefaultClientResources.builder();
    
            if ( computationThreadPoolSize > 0 ) {
                clientResourcesBuilder.computationThreadPoolSize( computationThreadPoolSize );
            }
    
            if ( ioThreadPoolSize > 0 ) {
                clientResourcesBuilder.ioThreadPoolSize( ioThreadPoolSize );
            }
    
            RedisClient redisClient = RedisClient.create( clientResourcesBuilder.build() );
            redisClient.setDefaultTimeout( Duration.ofMillis( connectionTimeout ) );
    
             StatefulRedisMasterReplicaConnection<String, byte[]> redisConnection = MasterReplica.connect(
                     redisClient,
                     StringByteArrayCodec.INSTANCE,
                     redisUri
             );
    
             redisConnection.setTimeout( Duration.ofMillis( redisServer.getCommandTimeout() ) );
             redisConnection.setReadFrom( redisSentinelServer.readFrom() );
    

With this in place, Lettuce continuously fails to reconnect to the master node if it was down on the initial connection. Here is the stack trace (up until the first application-level line):

io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='192.168.97.13', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='192.168.97.11', port=6379], role=SLAVE]]
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster(MasterSlaveConnectionProvider.java:309)
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getConnectionAsync(MasterSlaveConnectionProvider.java:158)
	at io.lettuce.core.masterslave.MasterSlaveChannelWriter.write(MasterSlaveChannelWriter.java:66)
	at io.lettuce.core.RedisChannelHandler.dispatch(RedisChannelHandler.java:187)
	at io.lettuce.core.StatefulRedisConnectionImpl.dispatch(StatefulRedisConnectionImpl.java:169)
	at io.lettuce.core.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:472)
	at io.lettuce.core.AbstractRedisAsyncCommands.set(AbstractRedisAsyncCommands.java:1223)
	at jdk.internal.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:57)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy47.set(Unknown Source)
	at fi.hibox.centre.module.job.rediscachepopulator.managers.RedisCrudManager.lambda$upsertEntity$0(RedisCrudManager.java:55)

The proper Redis master node is on 192.168.97.12 in this case, but Lettuce fails to ever realize it. Issuing a SENTINEL RESET * command manually seemed to help it, but... I'd much rather see it self-heal of course.


@mp911de Is my assumption that the fact that the 192.168.97.12 node was unavailable on startup, and Lettuce hence removed it from its list of "potential master nodes" anyway near correct, or is this a false assumption on my behalf? The Sentinel nodes was also down when the connection was made, so... could it be that it never connected to the Sentinel nodes in this case? (only to the Redis nodes). This would be one plausible theory about why it never would receive the pub/sub toplogy update in this case. Note, only guessing, I haven't looked at the Lettuce internals in this case. OTOH, issuing the Sentinel reset manually did indeed make it work, so I guess it must have been connected to Sentinel at that point at least...

Could it be like this: Sentinel was down when it first tried to connect => Lettuce never got the initial pub/sub topology update/updates. Once Lettuce had managed to connect it, all was fine in terms of subsequent topology updates, but the actual update/updates when the 192.168.97.12 was made the master was gone => it never managed to recover.

Distributed systems are indeed hard... :smile:

perlun avatar Dec 28 '20 09:12 perlun

Could it be like this: Sentinel was down when it first tried to connect => Lettuce never got the initial pub/sub topology update/updates. Once Lettuce had managed to connect it, all was fine in terms of subsequent topology updates, but the actual update/updates when the 192.168.97.12 was made the master was gone => it never managed to recover.

Another theory: it could actually have been that 192.168.97.12 was actually already the master. When it was firewalled => no other master could be elected. Once it came back again, the Redis slaves would reconnect to it nicely. But no topology update was posted, since the topology didn't actually change. (it was more a matter of "master came back", it was just the fact that Lettuce had never managed to connect to this node.)

perlun avatar Dec 28 '20 09:12 perlun

We are also facing the same issue with the master-slave sentinel setup. Using lettuce 5.2.2.RELEASE. Is there any update on this?

tushartvg avatar Apr 15 '21 05:04 tushartvg

We are also facing the same issue using lettuce 5.1.8.RELEASE.

Steven520xiaowei avatar Jun 17 '21 08:06 Steven520xiaowei

I am also facing same issue. Currently using 6.1.9 version. We recently introduced Master Slave concept, before we were just connecting to master node. We are using Spring class org.springframework.data.redis.connection.RedisStaticMasterReplicaConfiguration for configuration. Everything seems to be fine , but we noticed that in AWS ECS 2 out of 16 instances started throwing error

exception is io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterReplicaNode [redisURI=rediss://replica.prod-redis-cluster.XXXXXX.euw1.cache.amazonaws.com:6379?timeout=200000000ns, role=REPLICA]]

Also some logs here: Caused by: io.lettuce.core.RedisException: Cannot determine topology from [rediss://master.prod-redis-cluster.XXXXXX.euw1.cache.amazonaws.com:6379?timeout=200000000ns, rediss://replica.prod-redis-cluster.XXXXXX.euw1.cache.amazonaws.com:6379?timeout=200000000ns] at io.lettuce.core.masterreplica.StaticMasterReplicaConnector.lambda$connectAsync$0(StaticMasterReplicaConnector.java:72)

Not able to identify the cause , but for temporary fix have to restart the service and then it went well after that. But after few days again same issue arrive. I can't restart every time for this issue.

parmar049 avatar Oct 03 '22 08:10 parmar049