node-redis icon indicating copy to clipboard operation
node-redis copied to clipboard

createCluster clients don't handle on('error') correctly

Open kseth opened this issue 11 months ago • 6 comments

Description

We use cluster-mode with redis for sharded pub-sub (we have 3 masters and 3 replicas in a kubernetes cluster).

We have the following args for the clients:

    const clusterArgs = {
      rootNodes: [
        {
          url: `redis://${REDIS_CLUSTER_PUBSUB_HOST}:${redisPort}`,
        },
      ],
      defaults: {
        username: REDIS_CLUSTER_PUBSUB_NAME,
        password: REDIS_CLUSTER_PUBSUB_PASS,
        socket: {
          reconnectStrategy(retries: number) {
            if (retries >= 10) {
              console.error(
                `lost connection to redis cluster-pubsub cluster: tried ${retries} times`
              );
            } else {
              console.warn(
                `retrying redis cluster-pubsub cluster connection: tried ${retries} times`
              );
            }

            // reconnect after
            return Math.min(retries * 200, 2000);
          },
          connectTimeout: 10000,
          keepAlive: 60000,
        },
      },
    };

and then we create the client(s) like this:

const client = createCluster(clusterArgs);
await client.connect();
client.on('error', (err) => {
  console.error(`[PUB-SUB ERROR]: ${err}`);
});

Sometimes our redis pub-sub cluster goes down (i.e. for maintenance, when we upgrade to a new version, since we run it in kubernetes), and we'll receive the following error:

Error: Socket closed unexpectedly

We correctly log the error by catching it in the error handler, but we never seem to retry / reconnect -- the only way I can get a reconnect to actually happen is to continually restart the process until the reconnection succeeds.

Also, if the process tries to issue a command, we sometimes get an internal error killing the process because of a node uncaught exception, even though I've added a client.on('error') above.

I followed the findings from #2120 and #2302, but those don't really seem to solve our problems.

What I'd like is to be able to specify a reconnect strategy so that we continually try to retry (according to the reconnectStrategy) if we lose our TLS connection / fail to talk to a node in the cluster. Also, I'd like that we actually queue messages when we're offline instead of throwing an error and taking down the process.

Node.js Version

20.11.1

Redis Server Version

7.0.10

Node Redis Version

4.6.13

Platform

linux

Logs

No response

kseth avatar Mar 19 '24 23:03 kseth

I'm also observing this same behavior.

cmotl3y avatar May 11 '24 18:05 cmotl3y

I am also seeing this same behavior. It seems to happen for us whenever we perform something that will require a rolling update to redis (kubernetes) -- for example a version upgrade or a config update.

The application will end up spewing errors, and never reconnect, even when the cluster is healthy for a long time.

The only solution I could come up with is (like OP) to restart the process (in my case, failing the health check so k8 restarts the process).

This is certainly a non-ideal solution.

The general steps to repro:

  1. Run app in k8 that uses redis in cluster mode (i had 3master/3slave configuration)
  2. Perform a rollout restart of redis (this just incrementally restarts each redis)
  3. Notice how redis never reconnects properly

jjsimps avatar May 21 '24 00:05 jjsimps

I can also confirm this to be an issue. Our setup is almost identical to original post.

After doing any kind of rolling update to kubernetes redis cluster and pods are restarted we are getting Socket closed unexpectedly on clients and that is alright, but as explained above retry strategy never kicks in and cluster client ends in closed state.

While restarting affected service will reconnect properly, we have temporary workaround in place that should hopefully end up in less downtime. I have pasted simplified example below.

client.on('error', err => {
  console.error('Redis cluster error:', err);

  if (!recoonecting) {
    recoonecting = setTimeout(async () => {
      recoonecting = undefined;
      try {
        await reconnect();
      } catch (err) { }
    }, 5000);
  }
});

This is rather serious issue that should get more attention as at the moment there is no real alternative then this library to connect to redis cluster in node.js environment (we have had even worse experience when it comes to redis cluster with ioredis when using kubernetes).

DusanBaric avatar May 30 '24 12:05 DusanBaric

I'm also observing this same behavior.

drewbitz avatar Jun 03 '24 19:06 drewbitz

I find out that version 4.6.14 and 4.6.13(the ones I tested) are bugged and with a redis cluster going down the reconnectStrategy would not trigger. At the moment the only resolution was to revert to a working version, like 4.5.1. I think some regression has been introduced but I don't know much.

arcogabbo avatar Jun 20 '24 08:06 arcogabbo

i am seeing the same behaviour in my cluster with 4.6.14

downgrading to 4.5.1 is sadly not an option:

it does not seem to support hostname endpoints via --cluster-preferred-endpoint-type hostname and will try to reconnect to outdated node IPs (which changed after k8s rolling update), instead of using on the announced hostname

testing with ioredis shows no problems reconnecting to restarting cluster, as it's periodically rediscovering the slots/nodes every 5s. maybe that could be part of the solution here too?

carlhopf avatar Jun 25 '24 14:06 carlhopf

@leibale can you or one of the active maintainers take a look here?

We use node-redis as a dependency in our library and multiple projects that use our library have reported this problem. When a connection error occurs with the cluster, we need the reconnect strategy to kick in. But instead the client closes and all subsequent communication is blocked till a server restart occurs.

We would like to avoid building a workaround here. We use the library's client with near identical code for clusters and single instances and for single instances the reconnect works as expected. So any workaround will need some novel code separation. A workaround is also likely to clash with any future fixes in node-redis and the related dependency version limitations will cause operational overhead and confusion in the user base...

rlindner81 avatar Jul 02 '24 06:07 rlindner81

same issue - this PR is fixing the issue: https://github.com/redis/node-redis/pull/2783

soccermax avatar Jul 02 '24 12:07 soccermax

I can confirm https://github.com/redis/node-redis/pull/2783 fixes the issue for us.

rlindner81 avatar Jul 02 '24 13:07 rlindner81

@redis/[email protected]/[email protected] is on npm 🎉 @soccermax thanks! :)

leibale avatar Jul 02 '24 17:07 leibale

Thanks for the quick release. Very much appreciated 👍

soccermax avatar Jul 02 '24 17:07 soccermax