go-redis icon indicating copy to clipboard operation
go-redis copied to clipboard

context deadline not respected if backends are down

Open jsha opened this issue 3 years ago • 5 comments

Expected Behavior

When a context deadline is provided, and all backends are down, calls to ClusterClient.Get (and others) return shortly by the deadline or shortly after, with an error.

Current Behavior

Calls to ClusterClient.Get block for N * D, where N is the number of backends configured, and D is the DialTimeout.

Steps to Reproduce

package main

import (
	"context"
	"log"
	"time"

	"github.com/go-redis/redis/v8"
)

func main() {
	rdb := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{
			"203.0.113.91:999",
			"203.0.113.92:999",
			"203.0.113.93:999",
			"203.0.113.94:999",
			"203.0.113.95:999",
			"203.0.113.96:999",
		},
		DialTimeout:  time.Second,
		ReadTimeout:  time.Second,
		WriteTimeout: time.Second,
	})
	ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
	defer cancel()
	resp, err := rdb.Get(ctx, "example").Result()
	if err != nil {
		log.Fatal(err)
	}
	log.Print(resp)
}
$ time go run main.go
dial error: dial tcp 203.0.113.96:999: i/o timeout
dial error: dial tcp 203.0.113.95:999: i/o timeout
dial error: dial tcp 203.0.113.93:999: i/o timeout
dial error: dial tcp 203.0.113.94:999: i/o timeout
dial error: dial tcp 203.0.113.96:999: i/o timeout
dial error: dial tcp 203.0.113.91:999: i/o timeout
2022/07/06 11:31:31 context deadline exceeded
exit status 1

real    0m6.532s
user    0m0.925s
sys     0m0.281s

Expected: real time is less than 500ms, based on the 100ms context timeout. Actual: real time is 6.5s.

Note: the dial errors are present because I added a Println call to go-redis' dial code to make it clearer what's going on.

Context (Environment)

go version go1.18 linux/amd64 github.com/go-redis/redis/v8 v8.11.5

This is a simplified version of a problem we encountered when bringing up the OCSP responder for https://github.com/letsencrypt/boulder/ with a set of backends that could not be reached.

Possible Implementation

It's reasonable to keep trying to dial in the background, but the user-facing call should return based on the deadline.

jsha avatar Jul 06 '22 18:07 jsha

Here's a stack trace of where the dial calls are coming from. Notably one of the stack frames is in cmdsInfo, which loops through the available nodes without checking for a context deadline each time: https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/cluster.go#L1563-L1588

This is particularly an issue because cmdsInfo is called within an internal.Once.Do, which holds a mutex. And mutexes can't obey context deadlines, so all other goroutines block on that same mutex, for longer than their intended deadline.

goroutine 1 [running]:
runtime/debug.Stack()
        /home/jsha/go1.18/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
        /home/jsha/go1.18/src/runtime/debug/stack.go:16 +0x19
github.com/go-redis/redis/v8/internal/pool.(*ConnPool).dialConn(0xc0000b6280, {0x68fe48, 0xc0000a2000}, 0x1)
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/internal/pool/pool.go:199 +0xc6
github.com/go-redis/redis/v8/internal/pool.(*ConnPool).newConn(0xc0000b6280, {0x68fe48?, 0xc0000a2000?}, 0x1)
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/internal/pool/pool.go:163 +0x6e
github.com/go-redis/redis/v8/internal/pool.(*ConnPool).Get(0xc0000b6280, {0x68fe48, 0xc0000a2000})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/internal/pool/pool.go:279 +0x16b
github.com/go-redis/redis/v8.(*baseClient)._getConn(0xc0000d01a0, {0x68fe48, 0xc0000a2000})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:194 +0x38
github.com/go-redis/redis/v8.(*baseClient).getConn(0xc0000d01a0, {0x68fe48?, 0xc0000a2000?})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:182 +0x69
github.com/go-redis/redis/v8.(*baseClient).withConn(0xc0000d01a0, {0x68fe48?, 0xc0000a2000}, 0xc00009c780)
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:274 +0x73
github.com/go-redis/redis/v8.(*baseClient)._process(0xc0000d01a0, {0x68fe48, 0xc0000a2000}, {0x6909d0?, 0xc0000b4190}, 0x0?)
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:329 +0x114
github.com/go-redis/redis/v8.(*baseClient).process(0xc0000d01a0, {0x68fe48, 0xc0000a2000}, {0x6909d0, 0xc0000b4190})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:311 +0x79
github.com/go-redis/redis/v8.hooks.process({{0x0?, 0x10?, 0xc0000cf960?}}, {0x68fe48?, 0xc0000a2000?}, {0x6909d0, 0xc0000b4190}, 0x50?)
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:54 +0x7a
github.com/go-redis/redis/v8.(*Client).Process(0xc0000cf9a8?, {0x68fe48?, 0xc0000a2000?}, {0x6909d0?, 0xc0000b4190?})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:596 +0x70
github.com/go-redis/redis/v8.cmdable.Command(0xc0000987a0, {0x68fe48?, 0xc0000a2000})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/commands.go:467 +0xd7
github.com/go-redis/redis/v8.(*ClusterClient).cmdsInfo(0xc0000e8000, {0x68fe48, 0xc0000a2000})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/cluster.go:1602 +0x13b
github.com/go-redis/redis/v8.(*cmdsInfoCache).Get.func1()
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/command.go:3341 +0x3f
github.com/go-redis/redis/v8/internal.(*Once).Do(0xc0000d0168, 0xc00009edd0?)
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/internal/once.go:54 +0xaa
github.com/go-redis/redis/v8.(*cmdsInfoCache).Get(0xc0000d0160, {0x68fe48?, 0xc0000a2000?})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/command.go:3340 +0x5d
github.com/go-redis/redis/v8.(*ClusterClient).cmdInfo(0xc0000e8000, {0x62fc63, 0x3})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/cluster.go:1618 +0x47
github.com/go-redis/redis/v8.(*ClusterClient).process(0xc0000e8000, {0x68fe80?, 0xc0000b2180}, {0x690f10, 0xc0000b21e0})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/cluster.go:772 +0x75
github.com/go-redis/redis/v8.hooks.process({{0x0?, 0xc000098760?, 0x7ff15901dc40?}}, {0x68fe80?, 0xc0000b2180?}, {0x690f10, 0xc0000b21e0}, 0x60?)
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/redis.go:54 +0x7a
github.com/go-redis/redis/v8.(*ClusterClient).Process(0xc0000cfeb0?, {0x68fe80?, 0xc0000b2180?}, {0x690f10?, 0xc0000b21e0?})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/cluster.go:768 +0x6d
github.com/go-redis/redis/v8.cmdable.Get(0xc000098740, {0x68fe80?, 0xc0000b2180}, {0x630463, 0x7})
        /home/jsha/boulder/vendor/github.com/go-redis/redis/v8/commands.go:786 +0x122
main.main()
        /home/jsha/boulder/redis-lookup/main.go:23 +0x10c

jsha avatar Jul 06 '22 19:07 jsha

@jsha having same issue with v8 client getting context deadline exceeded Any luck with your issue? i did check redis docs and found this https://blog.uptrace.dev/posts/go-context-timeout.html even with individual ctx no luck

ehrktia avatar Dec 15 '22 14:12 ehrktia

@jsha mine got resolved when I got the URL and port corrected from localhost to use 0.0.0.0 in case any one stumbles the same could try this fix.

ehrktia avatar Dec 27 '22 20:12 ehrktia

well you should try Ping() first before trying to do operations on redis. after i setup the client, i usually try to ping the redis connection, if failed, then i send panic().

fari-99 avatar Jul 18 '23 07:07 fari-99