[BUG]In big cluster Redis (Valkey) replica node is not able to sync since 6.2.11
Describe the bug
Reference: Redis issue https://github.com/redis/redis/issues/12001 Since Redis PR https://github.com/redis/redis/pull/11785 is involved, once one single node data more than 7GB, replica node can not sync with primary node.
To reproduce
Reference from Redis issue https://github.com/redis/redis/issues/12001 and link https://github.com/redis/redis/issues/12001#issuecomment-1743066121
- create cluster with 6 nodes (3 masters, 3 replicas)
- fill each node with 7gb or more data
- restart replica
- with big probability replica won't be able to sync with error like == CRITICAL == This replica is sending an error to its master: 'Protocol error: too big inline request' after processing the command '
' sometimes though it successfully finishes (~10% of cases)
Expected behavior
Replica node could sync with primary node
Additional information
Any additional information that is relevant to the problem.
Is it starvation of the cluster bus?
Maybe something like dual channel sync can help? Rdb in fork process so main process can still talk to cluster bus...
the linked referenced issue also has a proposed open PR to attempt to fix the issue https://github.com/redis/redis/pull/13308
basically the redis inbound cluster connection tcp keepalive idle time was set to 2 * cluster node timeout, which typically people set to quite agggressive values (seconds). and tcp keepalive idle interval is set to 1/3 of idle time right now.
https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L1412
so the PR makes the tcp keepalive settings of the redis inbound cluster connection to be configurable by the existing config variable server.tcpkeepalive (same as other redis server connection).
Any update on this? We are still facing this issue. Any idea when/which version this would be fixed on?
the linked referenced issue also has a proposed open PR to attempt to fix the issue redis/redis#13308
happy to license that change under any license you need it; if valkey doesn't have such a fix yet, it really needs it.
Thanks @xnox! However, we don't even look at redis changes because of the risk of legal trouble.
Without looking at the redis PR, if it is as trivial as changing the tcp keepalive for cluster connections to a different value, we can of course do it.
If it involves more changes, the best from the legal point of view is that you contribute a PR yourself. When you sign off the commit (using -s option to git commit) you indicate that you have the right to contribute the change.
Ack, let me study valkey codebase to see if it complies with relevant TCP RFCs w.r.t. keepalives.
Currently, we set it as connKeepAlive(conn, server.cluster_node_timeout / 1000 * 2);.
connKeepAlive() calls anetKeepAlive(). In anetKeepAlive(), we can see that we set the probing interval TCP_KEEPINTVL to 1/3 of the TCP_KEEPALIVE value, so if cluster node timeout is 15000, TCP_KEEPALIVE is 30 seconds and the kernel will probe the connection after 10 seconds. We want to allow it to be unresponsive for the time of server.cluster_node_timeout without triggering a failover, right?
I'm not sure if I understand this mechanism correctly. It's better if the replication full sync would not block the cluster bus. It should take breaks to let the cluster communication work. I don't know why it is blocking for such a long time. Is it synchronous disk I/O?
@hpatro FYI (You're always interested in cluster issues.)
As I understand it on a Linux, if cluster node timeout is 15000, TCP_KEEPIDLE is set to 30 seconds. So after after 30 sec idling/silence the kernel will attempt to send 3 Keep-Alive frames (TCP_KEEPCNT) with a 10 sec (TCP_KEEPINTVL) interval, so the connection would be closed after 60sec.
We allow it to be unresponsive 4 times the cluster node timeout then?
TCP_KEEPALIVE seems to be a MacOS only sockopt, so the timeout might depends on what platform you seen this issue on.
@hpatro FYI (You're always interested in cluster issues.)
Yeah, thanks for the tag, this wasn't in my radar. Clustering stability matters a lot to us. I'm going to have to reproduce this, no one mentioned this to me internally, there are plenty of full sync operation happening all the time.
High level questions, don't want us to look into Redis repository.
- Any idea, why particularly 6.2.11 ?
- Has anyone else observed it?
@hpatro If you don't want to look into redis repository, I believe you can use Valkey's 6.2 branch. (We've deleted the old tags but we kept the old release branches.)
@hpatro If you don't want to look into redis repository, I believe you can use Valkey's 6.2 branch. (We've deleted the old tags but we kept the old release branches.)
We haven't fixed anything regarding this issue (if it existed), So, I should be able to reproduce it with unstable I believe. @zuiderkwast In your environment(s) you have much shorter timeout, do you observe this issue ?
In your environment(s) you have much shorter timeout, do you observe this issue ?
Not that I'm aware of. Maybe we don't have "one single node data more than 7GB".