go-redis Retries may be done using broken connections from the pool

Expected Behavior

When restarting the Redis server, due to the default of MaxRetries: 3, I'd expect go-redis to attempt to reconnect to the server and retry the query.

Current Behavior

If there are enough old connections in the connection pool, all retry attempts are done on broken connections and an error is returned without even attempting to reconnect.

Possible Solution

Some options:

Perform a ping on the connection before reusing it (adds latency if the connection is fine)
Don't count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query)
Clear the pool on errors (might be overly pessimistic)
Always use a fresh connection for retries

Steps to Reproduce

The following code reproduces the issue. Instead of restarting the server, it sets a dialer that simply closes each connection after 5 seconds.

package main

import (
	"context"
	"github.com/go-redis/redis/v8"
	"log"
	"net"
	"time"
)

func main() {
	ctx := context.Background()

	// Redis client with a dialer that kills connections after 5 seconds (to simulate a server restart).
	client := redis.NewClient(&redis.Options{
		Dialer: func(ctx context.Context, network, addr string) (net.Conn, error) {
			log.Print("Dialer called")
			conn, err := net.Dial(network, addr)
			if err == nil {
				go func() {
					time.Sleep(5 * time.Second)
					conn.Close()
				}()
			}
			return conn, err
		},
	})

	// Function to ping the server and log errors if they occur.
	ping := func() {
		_, err := client.Ping(ctx).Result()
		if err != nil {
			log.Print(err)
		}
	}

	// Perform some pings to fill the connection pool.
	for i := 0; i < 10; i++ {
		go ping()
	}

	// Wait for connections to die.
	time.Sleep(10 * time.Second)

	// Perform another ping and log pool stats before and after.
	log.Printf("%#v", client.PoolStats())
	ping()
	log.Printf("%#v", client.PoolStats())
}

Example output (note that the dialer is not called for the last ping and the hit count in the pool stats increases by 4, i.e. the initial attempt and all 3 retries were done using stale connections from the pool):

2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:41:00 set tcp 127.0.0.1:45526: use of closed network connection
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x4, Misses:0xa, Timeouts:0x0, TotalConns:0x6, IdleConns:0x6, StaleConns:0x0}

You get similar behavior when you restart Redis at the right time instead of using this dialer. In this case, the error will be EOF instead.

When adding MaxRetries: 20 to the redis.Options, the last 3 lines of the output look like this instead (note that there were 10 hits and then 1 miss that called the dialer):

2021/04/23 13:48:46 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:48:49 Dialer called
2021/04/23 13:48:49 &redis.PoolStats{Hits:0xa, Misses:0xb, Timeouts:0x0, TotalConns:0x1, IdleConns:0x1, StaleConns:0x0}

Context (Environment)

I want my application to gracefully handle Redis starts and had hoped that this was handled automatically by go-redis due to the default of MaxRetries: 3.

Apr 23 '21 11:04 julianbrost

Perform a ping on the connection before reusing it (adds latency if the connection is fine) Don't count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query) Clear the pool on errors (might be overly pessimistic) Always use a fresh connection for retries

All these options look worse than the current behavior. Even with large pools like 100 connections we are talking only about 100/3 = 33 failed commands.

Apr 23 '21 12:04 vmihailenco

Please could you explain why exactly you don’t prefer them, especially the first one?

Why not to check whether a connection is sane before using it "in production"?

Apr 23 '21 12:04 Al2Klimov

Because that means executing a ping command for each command we want to process. That means 2x lantency and 2x number of commands.

Apr 23 '21 14:04 vmihailenco

And the last suggested solution?

Always use a fresh connection for retries

If and only if something went wrong and we have to retry, do it with a new connection to avoid the risk of this issue?

Apr 23 '21 14:04 Al2Klimov

That is the most reasonable option if you can provide arguments why we should add some special logic to handle retries. As I said we are talking about 100/3 = 33 failed commands. In practice even that is an exaggeration, because there is a small delay before command is retried. So it is unlikely that the command will be retried using bad connections.

Apr 25 '21 09:04 vmihailenco

Maybe I don't get the purpose of the retry functionality in go-redis. I'd expect it to handle situations like a Redis restart where all existing connections fail simultaneously, so that I don't have to handle this in my application code. The current implementation leads to errors returned from go-redis in situations where I would not expect them.

In practice even that is an exaggeration, because there is a small delay before command is retried. So it is unlikely that the command will be retried using bad connections.

That's only the case if you perform lots of concurrent actions on the pool. If there are situations were only a few goroutines interact with Redis, it is quite likely to receive an error from go-redis even though an attempt on a fresh connection would have worked perfectly fine. Reducing the pool size is not an option if the amount of concurrent Redis queries changes with time.

Apr 26 '21 10:04 julianbrost

Well... unlikely != impossible. So I'd like to reduce 33 or even 3 to 0.

if you can provide arguments why we should add some special logic to handle retries

Pro: on Redis restart commands are not likely, but definitively not re-tried using bad connections. Increases app stability. Also apps don’t have to use pool size + x for max retries and Redis restart recoveries don’t take much longer.

Contra: you possibly re-dial unnecessarily, but if one connection breaks, the others likely do as well – Redis restart or similar. This should not hurt much as:

Redis should be operational the most time
if you’re going to retry, you've already got more latency

Shall I make a PR?

Apr 26 '21 14:04 Al2Klimov

This is what database/sql does - https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1808-L1816. You will have to

introduce the equivalent of cachedOrNewConn/alwaysNewConn when retrieving a conn
introduce ErrBadConn
refactor retry loop (in multiples places)

It is going to be non-trivial, but feel free to give it a try :)

May 05 '21 06:05 vmihailenco

And the much simpler suggestion (https://github.com/go-redis/redis/issues/1737#issuecomment-825703168) is not an option for you?

May 05 '21 09:05 Al2Klimov

If and only if something went wrong and we have to retry, do it with a new connection to avoid the risk of this issue?

database/sql does a similar thing but only as an extra last retry. The implementation cost will not be any different and it will match the database/sql behavior which is a nice thing.

May 05 '21 11:05 vmihailenco

It is going to be non-trivial

Problems aren’t stop signs, but signposts (for me). :)

May 05 '21 12:05 Al2Klimov

Hm... the command processing process is kinda complex (cmdable -> process -> _process -> withConn -> With* -> ...).

Where would you like to see the check whether the just returned error is X/Y/Z along with the returning of ErrBadConn?

I'd not like to just shoot in the dark by myself as you may say afterwards: err... no.

Or: do we even need that and ErrBadConn itself? Can't we just say retryable error -> bad conn?

May 05 '21 15:05 Al2Klimov

FWIW I can very reliably reproduce this issue on AWS ElasticCache using a small redis instance w/ primary and replica in non-clustered mode with the default parameter group. Importantly this sets close-on-replica-write to true. I then have multiple goroutines reading and writing from the redis server and initiate a failover. In variably this leads to read only commands like HKEYS failing with EOF error, presumably since they are reusing connections that were closed due to the close-on-replica-write setting but left in the pool.

Jun 25 '21 15:06 ryanschneider

What about adding a user-supplied OnReuseConnection hook like OnConnect that is called before a connection is reused from the pool?

Jun 25 '21 16:06 ryanschneider

Is this resolved with #1824 being merged?

Jul 21 '21 17:07 ajvb

Is this resolved with #1824 being merged?

We still can't solve this problem perfectly, because the redis-server startup time is uncertain, and there may be a short-term failure to connect to the redis-server.

If redis is in the startup process, we use Dial (tcp), it will still return an error, but I think this is normal, we can not make the client wait for redis-server

With #1824, we can let us discard the damaged connection, but the loss of redis-server restart should be handled by the user.

I think we can close this issue, there seems to be no better solution, what do everybody think of this?

Jul 22 '21 03:07 monkey92t

@monkey92t we can and probably will eventually do something like https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1808-L1816 , but it is not an easy refactoring. Perhaps for v9.

Jul 22 '21 05:07 vmihailenco

@monkey92t we can and probably will eventually do something like https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1808-L1816 , but it is not an easy refactoring. Perhaps for v9.

We can improve #1824, example:

func (p *ConnPool) Get(ctx context.Context) (*Conn, error) {
	if err := p.waitTurn(ctx); err != nil {
		return nil, err
	}

	for {
		p.connsMu.Lock()
		cn := p.popIdle()
		p.connsMu.Unlock()

		if cn == nil {
			break
		}

		// Skip GetConn and directly dial the new connection.
		if p.isStaleConn(cn) {
			_ = p.CloseConn(cn)
			break
		}

		atomic.AddUint32(&p.stats.Hits, 1)
		return cn, nil
	}

	atomic.AddUint32(&p.stats.Misses, 1)

	newcn, err := p.newConn(ctx, true)
	if err != nil {
		p.freeTurn()
		return nil, err
	}

	return newcn, nil
}

It may affect some unit tests...

In addition, I believe we have multiple locations that need to be changed, I will try it out and feel its complexity.

Jul 22 '21 06:07 monkey92t

go-redis go-redis copied to clipboard

Retries may be done using broken connections from the pool

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

go-redis
go-redis copied to clipboard