go-redis Cluster IP addresses not updated in client causing permanent failure

When using ClusterClient with Redis cluster and multiple node IP addresses change, the client may not be able to update its internal list and it never picks up the new ones.

Situation happens when in a dynamic environment such as Kubernetes where Redis cluster nodes run in pods with dynamic IP addresses. The client should be able to revert back to the original hostname, as it always contains latest updated addresses, rather than always preferring learned node IP addresses for discovering member changes.

Similar issue: #1077

Aug 14 '19 00:08 technicianted

@vmihailenco How about storing the cluster address and always using it to refresh cluster state?

In AWS ElastiCache Redis, we use a configuration endpoint, which is a DNS A record that has all the node IP addresses. I believe that each time cluster state is refreshed, it should always use the initial cluster address, instead of the learned nodes

Aug 19 '19 17:08 kollektiv

is there a way to force the dns lookup now that the dialer is exposed https://github.com/go-redis/redis/pull/1997 ?

Jun 17 '22 04:06 ashtonian

you can call ReloadState() on the client.

Jun 18 '22 02:06 technicianted

Seeing this as well

Jan 22 '24 14:01 akshatraika-moment

Node IPs can change after maintenance. Can the library support forced flushes on certain errors inherently if not the reverts to the originally stored IP? Either would work.

Jan 22 '24 15:01 akshatraika-moment

Just simulated this to try and see if using ReloadState would help, it does not.

Here's what I ran:

	// create cluster client
	client := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs:        sliceOfFourURLsToRedisClusterNodes, // defined above, excluded from example for security.
		ReadTimeout:  time.Millisecond * 50,
		WriteTimeout: time.Millisecond * 50,
		IdleTimeout:  time.Millisecond * 50,
	})

	// Set and get a key to test its working.
	fmt.Println("setting key")
	cmd := client.Set("kadins-key", utils.NewUUIDString(), time.Hour)
	fmt.Println(cmd.Result())
	fmt.Println(cmd.Err())
	fmt.Println("setting done, getting it back.")
	getCmd := client.Get("kadins-key")
	fmt.Println(getCmd.Result())
	fmt.Println(getCmd.Err())

	input := ""
	fmt.Scanln(&input)
	// Here we wait for input, while I modify the cluster, resulting in new hardware with new IPs. Once the modify
	// job is complete, and the DNS now points to new IPs, we continue and try setting / getting again. It will error as
	// the IPs that the client has resolved are not accessible anymore.

	fmt.Println("setting key")
	cmd = client.Set("kadins-key", utils.NewUUIDString(), time.Hour)
	fmt.Println(cmd.Result())
	fmt.Println(cmd.Err())
	fmt.Println("setting done, getting it back.")
	getCmd = client.Get("kadins-key")
	fmt.Println(getCmd.Result())
	fmt.Println(getCmd.Err())

	// We call Reload State.  It also errors with a networking issue.
	fmt.Scanln(&input)
	fmt.Println(client.ReloadState())

	// Now we try setting / getting again. It should work. But it doesn't. How can we tell the client to do another
	// DNS lookup? Should it do it automatically after a certain number of errors? Is there a way to do it manually? As
	// ReloadState doesn't seem to work.
	fmt.Scanln(&input)
	fmt.Println("setting key")
	cmd = client.Set("kadins-key", utils.NewUUIDString(), time.Hour)
	fmt.Println(cmd.Result())
	fmt.Println(cmd.Err())
	fmt.Println("setting done, getting it back.")
	getCmd = client.Get("kadins-key")
	fmt.Println(getCmd.Result())
	fmt.Println(getCmd.Err())

And here's the corresponding output:

setting key
OK <nil>
<nil>
setting done, getting it back.
fad49a63-e2bf-47fe-bebb-7c6dd28b22bb <nil>
<nil>

setting key
 dial tcp 172.31.40.132:6379: connect: no route to host
dial tcp 172.31.40.132:6379: connect: no route to host
setting done, getting it back.
 dial tcp 172.31.40.132:6379: connect: no route to host
dial tcp 172.31.40.132:6379: connect: no route to host

dial tcp 172.31.40.132:6379: connect: no route to host

setting key
 dial tcp 172.31.10.215:6379: i/o timeout
dial tcp 172.31.10.215:6379: i/o timeout
setting done, getting it back.
 dial tcp 172.31.40.132:6379: connect: no route to host
dial tcp 172.31.40.132:6379: connect: no route to host

Is there no way to re-lookup the addresses of the nodes without rebuilding the client entirely?

Apr 26 '24 14:04 mkadin