predixy icon indicating copy to clipboard operation
predixy copied to clipboard

Backend Redis Node Freeze (e.g. Kernel Panic) Causes Predixy To Become Stuck

Open williamchanrico opened this issue 3 years ago • 0 comments

Issue

When one of the Redis cluster node (either master/replica) fails on either Host/OS level, Predixy gets stuck because Predixy's TCP connections are not closing.

This issue is reproducible by causing a kernel panic on one of the Redis cluster node.

[email protected] $ sysctl -w kernel.panic="0"
[email protected] $ echo c > /proc/sysrq-trigger

We see connection from Predixy to the failed Redis cluster node in netstat:

SYN-SENT     0          1                                  10.1.1.177:59872                                          10.1.1.71:6379                            users:(("predixy",pid=11193,fd=8))                                             

And at that point, our clients just hang waiting for Predixy. Even when clients AND Predixy are restarted (reinitializing the connection pool), the problem persists. Setting various Redis client configurations like IdleTimeout and MaxConnLifetime didn't help. We also changed Predixy cluster config from KeepAlive "120" to "5" and it didn't make a difference.

We observed that Predixy will recover on its own after about 8-10 minutes, after multiple ERR no server connection avaliable error messages returned in those 8-10 minutes. (but most of the time client and Predixy are just sitting there frozen)

Potential Workaround

We found a TCP_USER_TIMEOUT that seemed to have helped Predixy to recover quickly.

To test that hypothesis, we added new setsockoption after src/Socket.cpp:180:

    val = 3;
    ret = setsockopt(mFd, IPPROTO_TCP, TCP_KEEPCNT, &val, sizeof(val));
    if (ret != 0) {
        return false;
    }
    // Custom option that we're adding
    val = 1000; // millisecond
    ret = setsockopt(mFd, IPPROTO_TCP, TCP_USER_TIMEOUT, &val, sizeof(val));
    if (ret != 0) {
        return false;
    }

And it fixed the problem by quickly returning multiple ERR no server connection avaliable every 1000ms before completely recovering. Is it a good idea to add this socket option?

Setup

OS:

Linux test-predixy-proxy-10-1-1-177 5.3.0-1030-gcp #32~18.04.1-Ubuntu SMP Thu Jun 25 19:30:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Predixy Version: 1.0.5 Predixy Config:

Name "predixy-redis-proxy"
Bind "0.0.0.0:6379"
WorkerThreads "4"
MaxMemory "0"
ClientTimeout "0"
BufSize "4096"

ClusterServerPool {
    Password "xyz"
    MasterReadPriority "50"
    StaticSlaveReadPriority "60"
    DynamicSlaveReadPriority "60"
    RefreshInterval "1"
    ServerTimeout "1"
    ServerFailureLimit "10"
    ServerRetryTimeout "1"
    KeepAlive "5"
    Servers {
        + 10.1.1.71:6379
        + 10.1.1.22:6379
        + 10.1.1.20:6379
        + 10.1.1.16:6379
        + 10.1.1.40:6379
        + 10.1.1.81:6379
    }
}

Client Library in Golang: https://github.com/garyburd/redigo/tree/master/redis

williamchanrico avatar Jun 21 '21 10:06 williamchanrico