lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

Strange Prod issue: java lettuce Redis client is trying to connect with strange host IP intermittently.

Open anandmehta-fareye opened this issue 11 months ago • 7 comments

We connect with AWS Redis elastic cache using lettuce java client but it's throwing below error intermittently, surprisingly we are not passing this IP anywhere in the code

Below is the code to create the connect bean

@Configuration
public class RedisConfig {

    @Value("${spring.redis.host}")
    private String redisService;

    @Value("${spring.redis.port}")
    private Integer redisPort;

    /**
     * Redis connection factory lettuce connection factory.
     *
     * @return the lettuce connection factory
     */
    @Bean
    public RedisConnectionFactory redisConnectionFactory() {
        LettuceClientConfiguration clientConfig = LettuceClientConfiguration.builder()
                .readFrom(ReadFrom.REPLICA_PREFERRED)
                .build();
        RedisStandaloneConfiguration serverConfig = new RedisStandaloneConfiguration(redisService, redisPort);
        return new LettuceConnectionFactory(serverConfig, clientConfig);
    }
}

if i am sending single request then its working fine, i am getting below error whenever i am trying to send multiple concurrent request.

[ioEventLoop-6-1] io.lettuce.core.AbstractRedisClient      : Connecting to Redis at 10.X.X.193/<unresolved>:6379: 10.X.X.193/<unresolved>:6379

io.netty.channel.ConnectTimeoutException: **connection timed out: /10.X.X.193:6379**
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[netty-transport-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[netty-transport-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
	at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
2025-02-03 17:40:44.182	
2025-02-03T12:10:44.182Z DEBUG 6 --- [ioEventLoop-6-2] io.lettuce.core.AbstractRedisClient      : Connecting to Redis at **REDIS HOST**/<unresolved>:6379: **Success**
2025-02-03 17:40:44.178	
2025-02-03T12:10:44.178Z DEBUG 6 --- [ioEventLoop-6-1] io.lettuce.core.AbstractRedisClient      : Connecting to Redis at 10.X.X.193/<unresolved>:6379

It's a prod issue, any help would be highly appreciated

anandmehta-fareye avatar Feb 04 '25 07:02 anandmehta-fareye

Hey @anandmehta-fareye ,

Have you checked the AWS ElasiCache Best Practises guide and specifically the section Java DNS cache TTL? Did you configure the JVM accordingly?

tishun avatar Feb 04 '25 08:02 tishun

@tishun I have checked the guide and also reached out to AWS regarding this. I learned that sometimes, when there is a parallel connection, the Lettuce Redis client tries to connect using the IP of the ENI(Elastic Network Interfaces), which causes connection time out. However, most of the time, it connects to the specified Elastic host, allowing us to establish a successful connection.

It's escalated to aws team , but still we haven't find the solution.Do you have any insights on this?

anandmehta-fareye avatar Feb 07 '25 07:02 anandmehta-fareye

@tishun I have checked the guide and also reached out to AWS regarding this. I learned that sometimes, when there is a parallel connection

Do you know what "parallel" means in this case?

, the Lettuce Redis client tries to connect using the IP of the ENI(Elastic Network Interfaces), which causes connection time out. However, most of the time, it connects to the specified Elastic host, allowing us to establish a successful connection.

It's escalated to aws team , but still we haven't find the solution.Do you have any insights on this?

I am afraid not. Seems like a very specific issue, related to AWS infrastructure, I think the AWS team would better help here.

If anything could be done from the side of the driver please let me know.

tishun avatar Feb 07 '25 13:02 tishun

Interestingly we are experiencing a somewhat similar issue.

We have clients of our redis clusters that, when a failover happens and a new redis node becomes a master and taking traffic, clients will attempt to connect to a bad IP address that doesn't exist in the cluster for some requests (not all).

I don't really have evidence that it is lettuce yet but I don't really know where it could be coming from.

RohanNagar avatar May 19 '25 18:05 RohanNagar

At that point I can only speculate, but:

  • the driver does not build IP addresses itself, to the best of my knowledge, so it must have received it from somewhere
  • that "somewhere" is either it's configuration or the server it talks to
  • the issue described by @anandmehta-fareye shows private IP addresses, and the description of the bug mentioned that these IP addresses are never configured by the client application

Having all these in mind I would assume that the wrong IP comes from the server (e.g. a MOVED response or some other indication ot the driver that the address has changed). I can also assume that this is an intermittent issue with the server configuration having private IP address set for a given shard, before some other task reconfigures the server with the public one.

But this is all just a speculation.

tishun avatar May 21 '25 07:05 tishun

Hey @tishun thanks for your response!

We actually were able to narrow down the issue and rule out Lettuce. The wrong IP is indeed coming from the node that is messed up - CLUSTER NODES from that node return the wrong IP, whereas CLUSTER NODES against all other nodes in the cluster return the correct IP for that bad node.

My theory is that we have some latent DNS issue on that bad host such that when a host starts up and adds itself to the cluster via its hostname it uses the incorrect IP locally (but other nodes have working DNS and see the correct IP).

We're planning to switch from registering nodes via hostname to registering via IP and see if that helps.

RohanNagar avatar May 22 '25 02:05 RohanNagar

Awesome! Thanks for coming back, the community would benefit from that knowledge!

tishun avatar May 22 '25 09:05 tishun

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 2 weeks this issue will be closed.

github-actions[bot] avatar Nov 27 '25 00:11 github-actions[bot]