node-redis icon indicating copy to clipboard operation
node-redis copied to clipboard

It will never reconnect when connection has closed

Open linkypi opened this issue 4 years ago • 6 comments

Hi, in the past few years, I find a critical bug in my production env. My production env use node-redis : 2.7.1. When the network disconnect , the node-redis will run the retry_strategy that I specified,and then the node-redis connection closed.But node-redis will not reconnect to the redis server again when the network is ok. So my production env always throw an error: The connection is already closed in the error log, always. Then I have to restart my node server to fix this. In my production env , the retry_strategy is:

retry_strategy: function (options) {
        // console.log('retry_strategy', options);
        if (options.error && options.error.code === 'ECONNREFUSED') {
            // End reconnecting on a specific error and flush all commands with
            // a individual error
            return new Error('The server refused the connection');
        }
        if (options.total_retry_time > 1000 * 60 * 60) {
            // End reconnecting after a specific timeout and flush all commands
            // with a individual error
            return new Error('Retry time exhausted');
        }
        if (options.attempt > 10) {
            // End reconnecting with built in error
            return new Error('Attempt time exhausted');
        }
        // reconnect after
        return Math.min(options.attempt * 100, 3000);
    }

I doubt a very long time, why this would happended. So I dig into the source codes , the node-redis version is 2.8.0. It has this bug too. Then I find the reason:

function handle_offline_command (self, command_obj) {
    var command = command_obj.command;
    var err, msg;
    if (self.closing || !self.enable_offline_queue) {
        command = command.toUpperCase();
        if (!self.closing) {
            if (self.stream.writable) {
                msg = 'The connection is not yet established and the offline queue is deactivated.';
            } else {
                msg = 'Stream not writeable.';
            }
        } else {
            msg = 'The connection is already closed.';
        }
        err = new errorClasses.AbortError({
            message: command + " can't be processed. " + msg,
            code: 'NR_CLOSED',
            command: command
        });
        if (command_obj.args.length) {
            err.args = command_obj.args;
        }
        utils.reply_in_order(self, command_obj.callback, err);
    } else {
        debug('Queueing ' + command + ' for next server connection.');
        self.offline_queue.push(command_obj);
    }
    self.should_buffer = true;
}

image image When the connection is down, then the ready is false and stream is not writable. And handle_offline_command will handle this .At this time the closing is always true, so it will throw an AbortError. So I donot know why retry connection to de redis server in this scope.

I fix this code like this, and it works:

function handle_offline_command (self, command_obj) {
    var command = command_obj.command;
    var err, msg;
    if (self.closing || !self.enable_offline_queue) {
        if(self.stream == null || self.stream.destroyed){
          retry_connection(self, null);
        }else{
          self.offline_queue.push(command_obj);
        }
        // command = command.toUpperCase();
        // if (!self.closing) {
        //     if (self.stream.writable) {
        //         msg = 'The connection is not yet established and the offline queue is deactivated.';
        //     } else {
        //         msg = 'Stream not writeable.';
        //     }
        // } else {
        //     msg = 'The connection is already closed.';
        // }
        // err = new errorClasses.AbortError({
        //     message: command + " can't be processed. " + msg,
        //     code: 'NR_CLOSED',
        //     command: command
        // });
        // if (command_obj.args.length) {
        //     err.args = command_obj.args;
        // }
        // utils.reply_in_order(self, command_obj.callback, err);
    } else {
        debug('Queueing ' + command + ' for next server connection.');
        self.offline_queue.push(command_obj);
    }
    self.should_buffer = true;
}

I donot know it will has any other logic code relatively, although I fix it and it work well . So I hope you give me a reliable solution, thks.The code is a little bit messy, so I very want to rewrite your node-redis using asyn await. But I have no time in this year.

linkypi avatar Jan 16 '20 07:01 linkypi

I've run into exactly this today with the same standard retry_strategy as shown above. In trying to understand what was going on, it struck me that this code doesn't seem to make a lot of sense:

if (options.error && options.error.code === 'ECONNREFUSED') {
    // End reconnecting on a specific error and flush all commands with an individual error
    return new Error('The server refused the connection');
}

Isn't that pretty much exactly the circumstance when you need the retries to continue (e.g. temporary server disconnect)?

Commenting the above lines out at least seems to let my retry strategy keep retrying. Are there other negative impacts of making this change?

stephent avatar Jan 26 '20 05:01 stephent

I'm trying to understand this also... As per the docs:

If you return a number from this function, the retry will happen exactly after that time in milliseconds. If you return a non-number, no further retry will happen and all offline commands are flushed with errors.

Which means that a retry_strategy like this:

if (options.error && options.error.code === 'ECONNREFUSED') {
    return new Error('The server refused the connection');
} 

will make the client never retry again. It must return a number value.

Now, what even gets worse is that in the example

var client = redis.createClient({
    retry_strategy: function (options) {
        if (options.error && options.error.code === 'ECONNREFUSED') {
            return new Error('The server refused the connection');
        }
        if (options.total_retry_time > 1000 * 60 * 60) {
            return new Error('Retry time exhausted');
        }
        if (options.attempt > 10) {
            return undefined;
        }
        return Math.min(options.attempt * 100, 3000);
    }
});

the 2nd and 3rd if statements will never be executed in case of failure of type ECONNREFUSED and re-connection won't stop after 1000 * 60 * 60ms or after 10 attempts!

After changing the order:

retry_strategy = (options) => {
    if (options.total_retry_time > 1000 * 60 * 60) {
      return new Error('Retry time exhausted');
    }
    if (options.attempt > 5) {
      return new Error('Max attempts reached');;
    }
    if (options.error && options.error.code === 'ECONNREFUSED') {
      return 500;
    }
    return Math.min(options.attempt * 100, 3000);
  }

It should work as intended. I guess.

Returning a new Error() won't emit any event log... Re-connection will stop without logging any error message.

eladbs avatar Jan 29 '20 16:01 eladbs

@eladbs nice investigation there, would you be willing to send over a PR to correct the example in the docs? Thank you

Salakar avatar Feb 09 '20 06:02 Salakar

I'm interested in what will happen to this moving forward in v3+, as connect_timeout is deprecated. To get the library to keep retrying forever, I needed to combine a simple retry strategy with the connect_timeout option, e.g

const retryStrategy = (options) => {
  // retry forever, with a max retry delay of 5s
  return Math.min(options.attempt * 100, 5000);
};

const client = redis.createClient(connectionString, {
  enable_offline_queue: false,
  retry_strategy: retryStrategy,
  connect_timeout: Number.MAX_SAFE_INTEGER, // (the default is 1 hour)
});

Otherwise, this still happens after one hour, regardless of the retry stategy.

https://github.com/NodeRedis/node-redis/blob/da31ade348f9b98ebc5a3164d59a93576a865906/index.js#L579

deancouch avatar Feb 10 '20 04:02 deancouch

Hi

As mentioned by @deancouch, if you look in the code, connect_timeout (it is not deprecated as far as I can see) is set to 1hr by default:

  this.connect_timeout = +options.connect_timeout || 3600000; // 60 * 60 * 1000 ms

Therefore, regardless of what workaround you use in your code tweak or retry strategy above, the library will throw an error and end the connection after this timeout. Unless you specific an even bigger number (like @deancouch's example above) to prevent that from happening, you could handle the condition.

For more thorough error handling, you can catch, recreate and throw on redis errors, as per snippet below.

public async checkRecreateRedisOnError(err: any) {
    const isRedisError = (err instanceof RedisError) ? true : false;
    const isAbortError = (err.constructor.name === "AbortError") ? true : false;
    const isAggregateError = (err.constructor.name === "AggregateError") ? true : false;
    const errorCode = err.code;
    if ((isRedisError || isAbortError || isAggregateError) && errorCode) {
        await this.checkRecreateRedis();
        // we rethrow the error after reconnecting, so upper level can perform a retry
        throw err;
    }
}

checkRecreateRedis() is shown in the code snippet below, which can also be called when not in an error condition, eg you have a watchdog loop which checks the health of the redis connection:

public async checkRecreateRedis() {
    if (this.redisClient.connected) {
        return;
    }
    while (!this.redisClient.connected) {
        if (this.redisClient.closing) {
            try {
                this.redisClient = await new RedisClient(this.options);
            } catch (err) {
                console.log(`Failed to recreate redis client: [${err.message}]`);
            }
        } else {
            await this.sleep(1000);
        }
    }
}

Please note that the above methods have been simplified to enhance readability. It also assumes you want to block on error in your code, so if your app cannot be blocked (eg you only want to try a redis read/write and log on error), please do not be use them. In this non-blocking case, you may just want to have a watchdog loop which calls checkRecreateRedis() to ensure the redis client is recreated after it has ended.

keith-chew avatar Feb 17 '20 11:02 keith-chew

        if (options.attempt > 10) {
            // End reconnecting with built in error
            return new Error('Attempt time exhausted');
        }

You can change the

return new Error('Attempt time exhausted');

to

return 1000;

It means retry_delay, then the client will retry without stopping

toearth avatar Sep 24 '20 07:09 toearth