go-redis
go-redis copied to clipboard
Retries may be done using broken connections from the pool
Expected Behavior
When restarting the Redis server, due to the default of MaxRetries: 3
, I'd expect go-redis to attempt to reconnect to the server and retry the query.
Current Behavior
If there are enough old connections in the connection pool, all retry attempts are done on broken connections and an error is returned without even attempting to reconnect.
Possible Solution
Some options:
- Perform a ping on the connection before reusing it (adds latency if the connection is fine)
- Don't count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query)
- Clear the pool on errors (might be overly pessimistic)
- Always use a fresh connection for retries
Steps to Reproduce
The following code reproduces the issue. Instead of restarting the server, it sets a dialer that simply closes each connection after 5 seconds.
package main
import (
"context"
"github.com/go-redis/redis/v8"
"log"
"net"
"time"
)
func main() {
ctx := context.Background()
// Redis client with a dialer that kills connections after 5 seconds (to simulate a server restart).
client := redis.NewClient(&redis.Options{
Dialer: func(ctx context.Context, network, addr string) (net.Conn, error) {
log.Print("Dialer called")
conn, err := net.Dial(network, addr)
if err == nil {
go func() {
time.Sleep(5 * time.Second)
conn.Close()
}()
}
return conn, err
},
})
// Function to ping the server and log errors if they occur.
ping := func() {
_, err := client.Ping(ctx).Result()
if err != nil {
log.Print(err)
}
}
// Perform some pings to fill the connection pool.
for i := 0; i < 10; i++ {
go ping()
}
// Wait for connections to die.
time.Sleep(10 * time.Second)
// Perform another ping and log pool stats before and after.
log.Printf("%#v", client.PoolStats())
ping()
log.Printf("%#v", client.PoolStats())
}
Example output (note that the dialer is not called for the last ping and the hit count in the pool stats increases by 4, i.e. the initial attempt and all 3 retries were done using stale connections from the pool):
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:41:00 set tcp 127.0.0.1:45526: use of closed network connection
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x4, Misses:0xa, Timeouts:0x0, TotalConns:0x6, IdleConns:0x6, StaleConns:0x0}
You get similar behavior when you restart Redis at the right time instead of using this dialer. In this case, the error will be EOF instead.
When adding MaxRetries: 20
to the redis.Options
, the last 3 lines of the output look like this instead (note that there were 10 hits and then 1 miss that called the dialer):
2021/04/23 13:48:46 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:48:49 Dialer called
2021/04/23 13:48:49 &redis.PoolStats{Hits:0xa, Misses:0xb, Timeouts:0x0, TotalConns:0x1, IdleConns:0x1, StaleConns:0x0}
Context (Environment)
I want my application to gracefully handle Redis starts and had hoped that this was handled automatically by go-redis due to the default of MaxRetries: 3
.
Perform a ping on the connection before reusing it (adds latency if the connection is fine) Don't count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query) Clear the pool on errors (might be overly pessimistic) Always use a fresh connection for retries
All these options look worse than the current behavior. Even with large pools like 100 connections we are talking only about 100/3 = 33
failed commands.
Please could you explain why exactly you don’t prefer them, especially the first one?
Why not to check whether a connection is sane before using it "in production"?
Because that means executing a ping command for each command we want to process. That means 2x lantency and 2x number of commands.
And the last suggested solution?
Always use a fresh connection for retries
If and only if something went wrong and we have to retry, do it with a new connection to avoid the risk of this issue?
That is the most reasonable option if you can provide arguments why we should add some special logic to handle retries. As I said we are talking about 100/3 = 33
failed commands. In practice even that is an exaggeration, because there is a small delay before command is retried. So it is unlikely that the command will be retried using bad connections.
Maybe I don't get the purpose of the retry functionality in go-redis. I'd expect it to handle situations like a Redis restart where all existing connections fail simultaneously, so that I don't have to handle this in my application code. The current implementation leads to errors returned from go-redis in situations where I would not expect them.
In practice even that is an exaggeration, because there is a small delay before command is retried. So it is unlikely that the command will be retried using bad connections.
That's only the case if you perform lots of concurrent actions on the pool. If there are situations were only a few goroutines interact with Redis, it is quite likely to receive an error from go-redis even though an attempt on a fresh connection would have worked perfectly fine. Reducing the pool size is not an option if the amount of concurrent Redis queries changes with time.
Well... unlikely != impossible. So I'd like to reduce 33 or even 3 to 0.
if you can provide arguments why we should add some special logic to handle retries
Pro: on Redis restart commands are not likely, but definitively not re-tried using bad connections. Increases app stability. Also apps don’t have to use pool size + x for max retries and Redis restart recoveries don’t take much longer.
Contra: you possibly re-dial unnecessarily, but if one connection breaks, the others likely do as well – Redis restart or similar. This should not hurt much as:
- Redis should be operational the most time
- if you’re going to retry, you've already got more latency
Shall I make a PR?
This is what database/sql does - https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1808-L1816. You will have to
- introduce the equivalent of cachedOrNewConn/alwaysNewConn when retrieving a conn
- introduce ErrBadConn
- refactor retry loop (in multiples places)
It is going to be non-trivial, but feel free to give it a try :)
And the much simpler suggestion (https://github.com/go-redis/redis/issues/1737#issuecomment-825703168) is not an option for you?
If and only if something went wrong and we have to retry, do it with a new connection to avoid the risk of this issue?
database/sql does a similar thing but only as an extra last retry. The implementation cost will not be any different and it will match the database/sql behavior which is a nice thing.
It is going to be non-trivial
Problems aren’t stop signs, but signposts (for me). :)
Hm... the command processing process is kinda complex (cmdable -> process -> _process -> withConn -> With* -> ...).
Where would you like to see the check whether the just returned error is X/Y/Z along with the returning of ErrBadConn?
I'd not like to just shoot in the dark by myself as you may say afterwards: err... no.
Or: do we even need that and ErrBadConn itself? Can't we just say retryable error -> bad conn?
FWIW I can very reliably reproduce this issue on AWS ElasticCache using a small redis instance w/ primary and replica in non-clustered mode with the default parameter group. Importantly this sets close-on-replica-write
to true. I then have multiple goroutines reading and writing from the redis server and initiate a failover. In variably this leads to read only commands like HKEYS
failing with EOF
error, presumably since they are reusing connections that were closed due to the close-on-replica-write
setting but left in the pool.
What about adding a user-supplied OnReuseConnection
hook like OnConnect
that is called before a connection is reused from the pool?
Is this resolved with #1824 being merged?
Is this resolved with #1824 being merged?
We still can't solve this problem perfectly, because the redis-server startup time is uncertain, and there may be a short-term failure to connect to the redis-server.
If redis is in the startup process, we use Dial (tcp), it will still return an error, but I think this is normal, we can not make the client wait for redis-server
With #1824, we can let us discard the damaged connection, but the loss of redis-server restart should be handled by the user.
I think we can close this issue, there seems to be no better solution, what do everybody think of this?
@monkey92t we can and probably will eventually do something like https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1808-L1816 , but it is not an easy refactoring. Perhaps for v9.
@monkey92t we can and probably will eventually do something like https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1808-L1816 , but it is not an easy refactoring. Perhaps for v9.
We can improve #1824, example:
func (p *ConnPool) Get(ctx context.Context) (*Conn, error) {
if err := p.waitTurn(ctx); err != nil {
return nil, err
}
for {
p.connsMu.Lock()
cn := p.popIdle()
p.connsMu.Unlock()
if cn == nil {
break
}
// Skip GetConn and directly dial the new connection.
if p.isStaleConn(cn) {
_ = p.CloseConn(cn)
break
}
atomic.AddUint32(&p.stats.Hits, 1)
return cn, nil
}
atomic.AddUint32(&p.stats.Misses, 1)
newcn, err := p.newConn(ctx, true)
if err != nil {
p.freeTurn()
return nil, err
}
return newcn, nil
}
It may affect some unit tests...
In addition, I believe we have multiple locations that need to be changed, I will try it out and feel its complexity.