node-redis
node-redis copied to clipboard
createCluster clients don't handle on('error') correctly
Description
We use cluster-mode with redis for sharded pub-sub (we have 3 masters and 3 replicas in a kubernetes cluster).
We have the following args for the clients:
const clusterArgs = {
rootNodes: [
{
url: `redis://${REDIS_CLUSTER_PUBSUB_HOST}:${redisPort}`,
},
],
defaults: {
username: REDIS_CLUSTER_PUBSUB_NAME,
password: REDIS_CLUSTER_PUBSUB_PASS,
socket: {
reconnectStrategy(retries: number) {
if (retries >= 10) {
console.error(
`lost connection to redis cluster-pubsub cluster: tried ${retries} times`
);
} else {
console.warn(
`retrying redis cluster-pubsub cluster connection: tried ${retries} times`
);
}
// reconnect after
return Math.min(retries * 200, 2000);
},
connectTimeout: 10000,
keepAlive: 60000,
},
},
};
and then we create the client(s) like this:
const client = createCluster(clusterArgs);
await client.connect();
client.on('error', (err) => {
console.error(`[PUB-SUB ERROR]: ${err}`);
});
Sometimes our redis pub-sub cluster goes down (i.e. for maintenance, when we upgrade to a new version, since we run it in kubernetes), and we'll receive the following error:
Error: Socket closed unexpectedly
We correctly log the error by catching it in the error handler, but we never seem to retry / reconnect -- the only way I can get a reconnect to actually happen is to continually restart the process until the reconnection succeeds.
Also, if the process tries to issue a command, we sometimes get an internal error killing the process because of a node uncaught exception, even though I've added a client.on('error')
above.
I followed the findings from #2120 and #2302, but those don't really seem to solve our problems.
What I'd like is to be able to specify a reconnect strategy so that we continually try to retry (according to the reconnectStrategy
) if we lose our TLS connection / fail to talk to a node in the cluster. Also, I'd like that we actually queue messages when we're offline instead of throwing an error and taking down the process.
Node.js Version
20.11.1
Redis Server Version
7.0.10
Node Redis Version
4.6.13
Platform
linux
Logs
No response
I'm also observing this same behavior.
I am also seeing this same behavior. It seems to happen for us whenever we perform something that will require a rolling update to redis (kubernetes) -- for example a version upgrade or a config update.
The application will end up spewing errors, and never reconnect, even when the cluster is healthy for a long time.
The only solution I could come up with is (like OP) to restart the process (in my case, failing the health check so k8 restarts the process).
This is certainly a non-ideal solution.
The general steps to repro:
- Run app in k8 that uses redis in cluster mode (i had 3master/3slave configuration)
- Perform a rollout restart of redis (this just incrementally restarts each redis)
- Notice how redis never reconnects properly
I can also confirm this to be an issue. Our setup is almost identical to original post.
After doing any kind of rolling update to kubernetes redis cluster and pods are restarted we are getting Socket closed unexpectedly on clients and that is alright, but as explained above retry strategy never kicks in and cluster client ends in closed state.
While restarting affected service will reconnect properly, we have temporary workaround in place that should hopefully end up in less downtime. I have pasted simplified example below.
client.on('error', err => {
console.error('Redis cluster error:', err);
if (!recoonecting) {
recoonecting = setTimeout(async () => {
recoonecting = undefined;
try {
await reconnect();
} catch (err) { }
}, 5000);
}
});
This is rather serious issue that should get more attention as at the moment there is no real alternative then this library to connect to redis cluster in node.js environment (we have had even worse experience when it comes to redis cluster with ioredis when using kubernetes).
I'm also observing this same behavior.
I find out that version 4.6.14
and 4.6.13
(the ones I tested) are bugged and with a redis cluster going down the reconnectStrategy
would not trigger. At the moment the only resolution was to revert to a working version, like 4.5.1
. I think some regression has been introduced but I don't know much.
i am seeing the same behaviour in my cluster with 4.6.14
downgrading to 4.5.1 is sadly not an option:
it does not seem to support hostname endpoints via --cluster-preferred-endpoint-type hostname
and will try to reconnect to outdated node IPs (which changed after k8s rolling update), instead of using on the announced hostname
testing with ioredis
shows no problems reconnecting to restarting cluster, as it's periodically rediscovering the slots/nodes every 5s. maybe that could be part of the solution here too?
@leibale can you or one of the active maintainers take a look here?
We use node-redis
as a dependency in our library and multiple projects that use our library have reported this problem. When a connection error occurs with the cluster, we need the reconnect strategy to kick in. But instead the client closes and all subsequent communication is blocked till a server restart occurs.
We would like to avoid building a workaround here. We use the library's client with near identical code for clusters and single instances and for single instances the reconnect works as expected. So any workaround will need some novel code separation. A workaround is also likely to clash with any future fixes in node-redis
and the related dependency version limitations will cause operational overhead and confusion in the user base...
same issue - this PR is fixing the issue: https://github.com/redis/node-redis/pull/2783
I can confirm https://github.com/redis/node-redis/pull/2783 fixes the issue for us.
@redis/[email protected]
/[email protected]
is on npm 🎉
@soccermax thanks! :)
Thanks for the quick release. Very much appreciated 👍