Strange Prod issue: java lettuce Redis client is trying to connect with strange host IP intermittently.
We connect with AWS Redis elastic cache using lettuce java client but it's throwing below error intermittently, surprisingly we are not passing this IP anywhere in the code
Below is the code to create the connect bean
@Configuration
public class RedisConfig {
@Value("${spring.redis.host}")
private String redisService;
@Value("${spring.redis.port}")
private Integer redisPort;
/**
* Redis connection factory lettuce connection factory.
*
* @return the lettuce connection factory
*/
@Bean
public RedisConnectionFactory redisConnectionFactory() {
LettuceClientConfiguration clientConfig = LettuceClientConfiguration.builder()
.readFrom(ReadFrom.REPLICA_PREFERRED)
.build();
RedisStandaloneConfiguration serverConfig = new RedisStandaloneConfiguration(redisService, redisPort);
return new LettuceConnectionFactory(serverConfig, clientConfig);
}
}
if i am sending single request then its working fine, i am getting below error whenever i am trying to send multiple concurrent request.
[ioEventLoop-6-1] io.lettuce.core.AbstractRedisClient : Connecting to Redis at 10.X.X.193/<unresolved>:6379: 10.X.X.193/<unresolved>:6379
io.netty.channel.ConnectTimeoutException: **connection timed out: /10.X.X.193:6379**
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[netty-transport-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[netty-transport-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.100.Final.jar!/:4.1.100.Final]
at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
2025-02-03 17:40:44.182
2025-02-03T12:10:44.182Z DEBUG 6 --- [ioEventLoop-6-2] io.lettuce.core.AbstractRedisClient : Connecting to Redis at **REDIS HOST**/<unresolved>:6379: **Success**
2025-02-03 17:40:44.178
2025-02-03T12:10:44.178Z DEBUG 6 --- [ioEventLoop-6-1] io.lettuce.core.AbstractRedisClient : Connecting to Redis at 10.X.X.193/<unresolved>:6379
It's a prod issue, any help would be highly appreciated
Hey @anandmehta-fareye ,
Have you checked the AWS ElasiCache Best Practises guide and specifically the section Java DNS cache TTL? Did you configure the JVM accordingly?
@tishun I have checked the guide and also reached out to AWS regarding this. I learned that sometimes, when there is a parallel connection, the Lettuce Redis client tries to connect using the IP of the ENI(Elastic Network Interfaces), which causes connection time out. However, most of the time, it connects to the specified Elastic host, allowing us to establish a successful connection.
It's escalated to aws team , but still we haven't find the solution.Do you have any insights on this?
@tishun I have checked the guide and also reached out to AWS regarding this. I learned that sometimes, when there is a parallel connection
Do you know what "parallel" means in this case?
, the Lettuce Redis client tries to connect using the IP of the ENI(Elastic Network Interfaces), which causes connection time out. However, most of the time, it connects to the specified Elastic host, allowing us to establish a successful connection.
It's escalated to aws team , but still we haven't find the solution.Do you have any insights on this?
I am afraid not. Seems like a very specific issue, related to AWS infrastructure, I think the AWS team would better help here.
If anything could be done from the side of the driver please let me know.
Interestingly we are experiencing a somewhat similar issue.
We have clients of our redis clusters that, when a failover happens and a new redis node becomes a master and taking traffic, clients will attempt to connect to a bad IP address that doesn't exist in the cluster for some requests (not all).
I don't really have evidence that it is lettuce yet but I don't really know where it could be coming from.
At that point I can only speculate, but:
- the driver does not build IP addresses itself, to the best of my knowledge, so it must have received it from somewhere
- that "somewhere" is either it's configuration or the server it talks to
- the issue described by @anandmehta-fareye shows private IP addresses, and the description of the bug mentioned that these IP addresses are never configured by the client application
Having all these in mind I would assume that the wrong IP comes from the server (e.g. a MOVED response or some other indication ot the driver that the address has changed). I can also assume that this is an intermittent issue with the server configuration having private IP address set for a given shard, before some other task reconfigures the server with the public one.
But this is all just a speculation.
Hey @tishun thanks for your response!
We actually were able to narrow down the issue and rule out Lettuce. The wrong IP is indeed coming from the node that is messed up - CLUSTER NODES from that node return the wrong IP, whereas CLUSTER NODES against all other nodes in the cluster return the correct IP for that bad node.
My theory is that we have some latent DNS issue on that bad host such that when a host starts up and adds itself to the cluster via its hostname it uses the incorrect IP locally (but other nodes have working DNS and see the correct IP).
We're planning to switch from registering nodes via hostname to registering via IP and see if that helps.
Awesome! Thanks for coming back, the community would benefit from that knowledge!
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 2 weeks this issue will be closed.