predixy
predixy copied to clipboard
Backend Redis Node Freeze (e.g. Kernel Panic) Causes Predixy To Become Stuck
Issue
When one of the Redis cluster node (either master/replica) fails on either Host/OS level, Predixy gets stuck because Predixy's TCP connections are not closing.
This issue is reproducible by causing a kernel panic on one of the Redis cluster node.
[email protected] $ sysctl -w kernel.panic="0"
[email protected] $ echo c > /proc/sysrq-trigger
We see connection from Predixy to the failed Redis cluster node in netstat:
SYN-SENT 0 1 10.1.1.177:59872 10.1.1.71:6379 users:(("predixy",pid=11193,fd=8))
And at that point, our clients just hang waiting for Predixy.
Even when clients AND Predixy are restarted (reinitializing the connection pool), the problem persists.
Setting various Redis client configurations like IdleTimeout and MaxConnLifetime didn't help.
We also changed Predixy cluster config from KeepAlive "120"
to "5"
and it didn't make a difference.
We observed that Predixy will recover on its own after about 8-10 minutes, after multiple ERR no server connection avaliable
error messages returned in those 8-10 minutes. (but most of the time client and Predixy are just sitting there frozen)
Potential Workaround
We found a TCP_USER_TIMEOUT that seemed to have helped Predixy to recover quickly.
To test that hypothesis, we added new setsockoption after src/Socket.cpp:180:
val = 3;
ret = setsockopt(mFd, IPPROTO_TCP, TCP_KEEPCNT, &val, sizeof(val));
if (ret != 0) {
return false;
}
// Custom option that we're adding
val = 1000; // millisecond
ret = setsockopt(mFd, IPPROTO_TCP, TCP_USER_TIMEOUT, &val, sizeof(val));
if (ret != 0) {
return false;
}
And it fixed the problem by quickly returning multiple ERR no server connection avaliable
every 1000ms before completely recovering.
Is it a good idea to add this socket option?
Setup
OS:
Linux test-predixy-proxy-10-1-1-177 5.3.0-1030-gcp #32~18.04.1-Ubuntu SMP Thu Jun 25 19:30:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Predixy Version: 1.0.5
Predixy Config:
Name "predixy-redis-proxy"
Bind "0.0.0.0:6379"
WorkerThreads "4"
MaxMemory "0"
ClientTimeout "0"
BufSize "4096"
ClusterServerPool {
Password "xyz"
MasterReadPriority "50"
StaticSlaveReadPriority "60"
DynamicSlaveReadPriority "60"
RefreshInterval "1"
ServerTimeout "1"
ServerFailureLimit "10"
ServerRetryTimeout "1"
KeepAlive "5"
Servers {
+ 10.1.1.71:6379
+ 10.1.1.22:6379
+ 10.1.1.20:6379
+ 10.1.1.16:6379
+ 10.1.1.40:6379
+ 10.1.1.81:6379
}
}
Client Library in Golang: https://github.com/garyburd/redigo/tree/master/redis