hedis icon indicating copy to clipboard operation
hedis copied to clipboard

The driver does not deal with changing redis IP addresses in cluster mode

Open akshaymankar opened this issue 2 years ago • 20 comments

Hedis loses connection to a redis cluster, if the redis cluster nodes change their IP addresses during a restart.

How to reproduce the problem

  1. Start haskell app which talks to redis in cluster mode
  2. Restart each redis node in a way that IP address of each node changes. This happens when using bitnami redis-cluster helm chart.
  3. See that the app cannot talk to redis anymore

While the app cannot talk to redis, we get these kinds of error messages:

service: Network.Socket.connect: <socket: 51>: does not exist (No route to host)

Impact

This makes using hedis in Kubernetes very difficult.

Speculation

This could be a result of using socketToHandle here. The docs for the function say:

Caveat Handle is not recommended for network programming in Haskell, e.g. merely performing hClose on a TCP socket won't cooperate with peer's gracefulClose, i.e. proper shutdown sequence with appropriate handshakes specified by the protocol.

It could also be some other problem, this is just speculation.

Additional Info

I create ConnectInfo like this:

Redis.defaultConnectInfo
  { Redis.connectHost = "<kubernetes-service-name>"
    Redis.connectPort = 6379,
    Redis.connectTimeout = Just (secondsToNominalDiffTime 5),
    Redis.connectMaxConnections = 100
  }

And connect to redis like this:

Redis.connectCluster connInfo

This connection does not use TLS.

akshaymankar avatar May 30 '22 15:05 akshaymankar

Please see https://github.com/informatikr/hedis/issues/184

k-bx avatar May 30 '22 16:05 k-bx

@michaelglass, https://github.com/informatikr/hedis/issues/168 suggests you're using cluster support quite extensively. Is this something you could take a look into?

flokli avatar May 30 '22 16:05 flokli

Need to run refreshShardMap which would get the new cluster info (by invoking cluster info call to redis). But in your case how will this query even run, you would need at least one node's static IP to run this query and get info which can be used to connect to all the nodes with new IPs.

PS: the current connections being held in pool might be still used and removed after only exceptions are thrown.

aravindgopall avatar Jan 06 '23 07:01 aravindgopall

sorry about the late response. Maybe @stoeffel or @omnibs might have more context. I don't write much Haskell anymore 😔

michaelglass avatar Jan 06 '23 09:01 michaelglass

For anyone else looking here, we have fork in juspay (https://github.com/juspay/hedis) with some of these fixes done already. Would like to upstream them in sometime.

aravindgopall avatar Jan 13 '23 06:01 aravindgopall

@aravindgopall Glad to see you forked. Is it the fix/connection branch people are supposed to use? Why is it 104 commits behind? Would you be interested in taking over hedis maintainership?

ysangkok avatar Jan 13 '23 14:01 ysangkok

Right now it is fix/connection branch, we are testing the changes in our setup. Once done will merge everything to cluster branch followed by master.

Why is it 104 commits behind? The cluster branch have all of the changes before being merge to upstream hedis (done by @alexjg), there is a PR waiting to make it up to date with upstream. This all might take a week or two to settle.

@ysangkok Would be very much interested on this.

aravindgopall avatar Jan 14 '23 15:01 aravindgopall

JFI. fix/connection branch is up to date with upstream master and being tested. Once done will make merge to master and release.

aravindgopall avatar Jan 18 '23 07:01 aravindgopall

JFI, cluster branch have the latest code with the above fix and also some other fixes in Juspay Fork.

@ysangkok how do you want me to continue, shall i create a PR to upstream or we can continue on juspay fork.

aravindgopall avatar Jan 23 '23 15:01 aravindgopall

JFI, cluster branch have the latest code with the above fix and also some other fixes in Juspay Fork.

@ysangkok how do you want me to continue, shall i create a PR to upstream or we can continue on juspay fork.

I would suggest creating a PR to this repo and we can get it merged.

jbrechtel avatar Jan 25 '23 14:01 jbrechtel

Right now it is fix/connection branch, we are testing the changes in our setup. Once done will merge everything to cluster branch followed by master.

Why is it 104 commits behind? The cluster branch have all of the changes before being merge to upstream hedis (done by @alexjg), there is a PR waiting to make it up to date with upstream. This all might take a week or two to settle.

@ysangkok Would be very much interested on this.

I mistakenly offered to maintain this library some months ago and it turns that I simply don't have the time. Could you email me and we can discuss handing over maintainership?

Hi @aravindgopall - I mistakenly offered to maintain this library some months ago and it turns that I simply don't have the time. Could you email me and we can discuss handing over maintainership? james at flipstone . com

jbrechtel avatar Jan 25 '23 14:01 jbrechtel

Need to run refreshShardMap which would get the new cluster info (by invoking cluster info call to redis). But in your case how will this query even run, you would need at least one node's static IP to run this query and get info which can be used to connect to all the nodes with new IPs.

PS: the current connections being held in pool might be still used and removed after only exceptions are thrown.

This issue won't be solved by refreshing the shard map from IP addresses existing in the map: after Helm updated the Redis cluster, all Redis nodes will have new IP addresses assigned. That is, no IP address stored in the available shard map would connect to a then running Redis node. The only way to reconnect the client is by resolving a new valid IP address via the Redis cluster's DNS name and then obtain an entirely new shard map.

For our use-case, we retain the DNS name outside Hedis. This is suboptimal, since the DNS name is passed to Hedis for establishing the initial connection and could be retained for failovers when all IPs have become invalid. For instance, due to Helm being a villain.

stefanwire avatar Jan 31 '23 10:01 stefanwire

@stefanwire This is a valid scenario. Right now it resolves this issue where at least one of the Redis node persisted the ip address. Generally helpful in more frequent scenarios like failovers, scaling etc..

aravindgopall avatar Jan 31 '23 11:01 aravindgopall

Feel free to check and adapt the code linked in my previous comment to solve the issue where all IPs wouldn't connect anymore. Also, check my open PR which adds the CLUSTER INFO command in order to improve robustness for Redis in cluster mode. We would be happy to use Hedis without local wrappers when this is done and merged.

stefanwire avatar Jan 31 '23 12:01 stefanwire

I did a quick check on Juspay's branch with the following test scenario:

  • AWS Elasticache Redis cluster, 1 shard, 2 nodes
  • Launch program, issue some redis commands
  • Trigger "failover primary" on the EC UI
  • Try to issue any additional redis commands

It doesn't seem to work:

  • Without Juspay's branch: timeout executing redis commands, forever
  • With Juspay's branch: NoNodeException, forever

This means the shardMap MVar introduced in Juspay's branch somehow becomes empty during failover, and there's no recovery mechanism for when that happens.

omnibs avatar Mar 11 '24 23:03 omnibs

Hi @omnibs , can you give a try with Juspay master branch and share the results. In our internal test it recovers without NoNodeException. Please let us know if you need any more info.

aravindgopall avatar Mar 12 '24 04:03 aravindgopall

With the master branch it does recover from NoNodeException. Thank you for the pointer!

It seemed to take several minutes, though (can't re-test bc I ran out of test failovers for today).

I don't follow in the code what's making it recover, but I'm curious if there's something we can tweak to make it recover faster.

omnibs avatar Mar 12 '24 06:03 omnibs

it's retry mechanism based on MOVED/timeout (this can be configured to recover faster), when it occurs it will get the new shardMap info by picking a random node (again chances this might be down too can cause the time). Post this it will send the command to the new node which is responsible.

Would be great to see some numbers, graphs to understand this. (eg: re sharding time, throughput, how many errors etc..)

aravindgopall avatar Mar 12 '24 07:03 aravindgopall

Thank you for explaining!

So I re-ran my test in a simple manner:

  • I left curl on a loop, poking the health-check endpoint of our app in an environment every 2s
    • The health-check endpoint runs a PING through hedis, nothing more.
  • I triggered a failover
  • I waited for things to normalize

It took a pretty long time:

  • 15:12:24 - good
  • 15:12:26 - Connection Lost (failover happened)
  • 15:12:29 - NoNodeException
  • ...
  • 15:21:04 - good (recovery)

So it took 8min30s to recover from NoNodeException.

The failover itself is quick. I didn't time it, but restarting another instance of the app brought it back to life immediately, while the instance I was curling kept at NoNodeException for several minutes after that.

it's retry mechanism based on MOVED/timeout (this can be configured to recover faster), when it occurs it will get the new shardMap info by picking a random node

Do my results make sense with this mechanism?

It sounds like after the failover my shardMap was suddenly empty, and kept empty for a long time, so it couldn't have tried to pick a random node. But if this is correct, I don't understand what makes it recover after the 8min30s.

I plan to debug the mechanism further to figure out what's up, but any pointers you might have on what's happening would be appreciated!

ps: just to clarify, there's no resharding time, the failover works over a single shard

omnibs avatar Mar 13 '24 22:03 omnibs

Ok, my test was too simplistic, and for Redis Cluster's intended usage, perhaps unrealistic.

Redis outside Elasticache requires at least 3 primaries for cluster mode. Elasticache allows you to set it up with a single primary.

I tested a local redis cluster with 3 primaries, ran a failover with debug segfault and with manual failover, and I didn't even notice the failover. No errors.

I tested failing over an Elasticache cluster with 2 primaries, and all went well. We got a single NoNodeException, which turned into a timeout after 500ms, and fully recovered after 12s, which is very reasonable.

@aravindgopall one last question then, does Juspay plan to upstream its fixes to cluster mode? I noticed there were 2 PRs for upstreaming them which got closed (#192 and #191).

omnibs avatar Mar 14 '24 00:03 omnibs