valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[BUG]In big cluster Redis (Valkey) replica node is not able to sync since 6.2.11

Open hwware opened this issue 1 year ago • 14 comments

Describe the bug

Reference: Redis issue https://github.com/redis/redis/issues/12001 Since Redis PR https://github.com/redis/redis/pull/11785 is involved, once one single node data more than 7GB, replica node can not sync with primary node.

To reproduce

Reference from Redis issue https://github.com/redis/redis/issues/12001 and link https://github.com/redis/redis/issues/12001#issuecomment-1743066121

  1. create cluster with 6 nodes (3 masters, 3 replicas)
  2. fill each node with 7gb or more data
  3. restart replica
  4. with big probability replica won't be able to sync with error like == CRITICAL == This replica is sending an error to its master: 'Protocol error: too big inline request' after processing the command '' sometimes though it successfully finishes (~10% of cases)

Expected behavior

Replica node could sync with primary node

Additional information

Any additional information that is relevant to the problem.

hwware avatar Jul 24 '24 14:07 hwware

Is it starvation of the cluster bus?

Maybe something like dual channel sync can help? Rdb in fork process so main process can still talk to cluster bus...

zuiderkwast avatar Jul 26 '24 08:07 zuiderkwast

the linked referenced issue also has a proposed open PR to attempt to fix the issue https://github.com/redis/redis/pull/13308

basically the redis inbound cluster connection tcp keepalive idle time was set to 2 * cluster node timeout, which typically people set to quite agggressive values (seconds). and tcp keepalive idle interval is set to 1/3 of idle time right now.

https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L1412

so the PR makes the tcp keepalive settings of the redis inbound cluster connection to be configurable by the existing config variable server.tcpkeepalive (same as other redis server connection).

shanipribadi avatar Jul 27 '24 16:07 shanipribadi

Any update on this? We are still facing this issue. Any idea when/which version this would be fixed on?

sureshvasanthkumar avatar Mar 26 '25 21:03 sureshvasanthkumar

the linked referenced issue also has a proposed open PR to attempt to fix the issue redis/redis#13308

happy to license that change under any license you need it; if valkey doesn't have such a fix yet, it really needs it.

xnox avatar Aug 13 '25 23:08 xnox

Thanks @xnox! However, we don't even look at redis changes because of the risk of legal trouble.

Without looking at the redis PR, if it is as trivial as changing the tcp keepalive for cluster connections to a different value, we can of course do it.

If it involves more changes, the best from the legal point of view is that you contribute a PR yourself. When you sign off the commit (using -s option to git commit) you indicate that you have the right to contribute the change.

zuiderkwast avatar Aug 14 '25 00:08 zuiderkwast

Ack, let me study valkey codebase to see if it complies with relevant TCP RFCs w.r.t. keepalives.

xnox avatar Aug 14 '25 00:08 xnox

Currently, we set it as connKeepAlive(conn, server.cluster_node_timeout / 1000 * 2);.

connKeepAlive() calls anetKeepAlive(). In anetKeepAlive(), we can see that we set the probing interval TCP_KEEPINTVL to 1/3 of the TCP_KEEPALIVE value, so if cluster node timeout is 15000, TCP_KEEPALIVE is 30 seconds and the kernel will probe the connection after 10 seconds. We want to allow it to be unresponsive for the time of server.cluster_node_timeout without triggering a failover, right?

I'm not sure if I understand this mechanism correctly. It's better if the replication full sync would not block the cluster bus. It should take breaks to let the cluster communication work. I don't know why it is blocking for such a long time. Is it synchronous disk I/O?

zuiderkwast avatar Aug 14 '25 10:08 zuiderkwast

@hpatro FYI (You're always interested in cluster issues.)

zuiderkwast avatar Aug 14 '25 10:08 zuiderkwast

As I understand it on a Linux, if cluster node timeout is 15000, TCP_KEEPIDLE is set to 30 seconds. So after after 30 sec idling/silence the kernel will attempt to send 3 Keep-Alive frames (TCP_KEEPCNT) with a 10 sec (TCP_KEEPINTVL) interval, so the connection would be closed after 60sec. We allow it to be unresponsive 4 times the cluster node timeout then?

TCP_KEEPALIVE seems to be a MacOS only sockopt, so the timeout might depends on what platform you seen this issue on.

bjosv avatar Aug 14 '25 12:08 bjosv

@hpatro FYI (You're always interested in cluster issues.)

Yeah, thanks for the tag, this wasn't in my radar. Clustering stability matters a lot to us. I'm going to have to reproduce this, no one mentioned this to me internally, there are plenty of full sync operation happening all the time.

High level questions, don't want us to look into Redis repository.

  • Any idea, why particularly 6.2.11 ?
  • Has anyone else observed it?

hpatro avatar Aug 18 '25 16:08 hpatro

@hpatro If you don't want to look into redis repository, I believe you can use Valkey's 6.2 branch. (We've deleted the old tags but we kept the old release branches.)

zuiderkwast avatar Aug 18 '25 18:08 zuiderkwast

@hpatro If you don't want to look into redis repository, I believe you can use Valkey's 6.2 branch. (We've deleted the old tags but we kept the old release branches.)

We haven't fixed anything regarding this issue (if it existed), So, I should be able to reproduce it with unstable I believe. @zuiderkwast In your environment(s) you have much shorter timeout, do you observe this issue ?

hpatro avatar Aug 18 '25 18:08 hpatro

In your environment(s) you have much shorter timeout, do you observe this issue ?

Not that I'm aware of. Maybe we don't have "one single node data more than 7GB".

zuiderkwast avatar Aug 18 '25 19:08 zuiderkwast