go-redis
go-redis copied to clipboard
Constantly Reestablishing Connections to AWS ElastiCache Redis in Cluster Mode (Continued)
(See previous issue for additional context and discussion)
Expected Behavior
A ClusterClient
with bare-minimum settings for connecting to an AWS ElastiCache Redis cluster (from here on referred to as ECRedis) will, at no point, contribute to a scenario in which it is constantly reestablishing connections to an ECRedis cluster node.
Example ClusterClient
config:
redis.NewClusterClient(&redis.ClusterOptions{
Addrs: []string{redisAddr},
TLSConfig: &tls.Config{},
})
Current Behavior
(The Current Behavior section in the original issue is still accurate to this issue)
Occasionally, we see connections being constantly re-established to one of our ECRedis cluster nodes at the limit of how many new connections are possible (~15k/minute is the reported rate). Redis nodes are essentially single-threaded and negotiating TLS for new connections takes up 100% of this node's CPU, preventing it from doing any other work. The time at which this issue occurs seems random, and we cannot correlate it to:
- Amount of load on the system (# Redis commands)
- Events happening on the ECRedis cluster (resharding, cycling out nodes, failovers, etc.)
- Any other issues with the ECRedis cluster not normally visible in the AWS Console (we consulted with AWS support for this one)
- Service restarts for our Go service that communicates with ECRedis
When this issue happens, running CLIENT LIST
on the affected Redis node shows age=0
or age=1
for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.
After discussion on the previous issue we are no longer the only ones experiencing this unknown problem, and it is blocking us from further relying on this service we've been building (and delaying full use of) for well over 6 months now.
Possible Solution
Redis ClusterClient
should react more gracefully to
After the previous issue, we have tried a variety of approaches to mitigate this problem, none of which have solved it entirely:
- Increasing all timeouts in
context.Context
s passed to Redis commands to 1 second or greater - Reducing # of goroutines exercising our Redis claim logic
- Lowering connection pool size
- Moving all operations out of
MULTI
/EXEC
pipelines - Completely disabling idle connection reaping
- Fuzzing the interval at which each individual cluster node's Client cleans idle connections
- Changing the instance type of the ElastiCache Redis nodes
- Rebalancing/increasing shard count for the ElastiCache Redis cluster
We are seeing the issue a lot less often after
- Reducing # of goroutines exercising our Redis claim logic
- Lowering pool size
- Fuzzing the interval at which each individual cluster node's Client cleans idle connections
but are not confident that these changes constitute a permanent fix and are instead a consequence of there being fewer potential resources to overload an individual ECRedis cluster node.
Details About Fuzzing Idle Connection Cleanup
We opted to set IdleTimeout
and IdleCheckFrequency
on the redis.ClusterOptions
passed to our client to -1
and added a custom NewClient
implementation:
// A new redis.Client is created for each Redis cluster node to which a redis.ClusterClient
// sends commands. Make each redis.Client clean idle connections at a different cadence.
redisOpts.NewClient = func(clientOptions *redis.Options) *redis.Client {
rand.Seed(time.Now().UnixNano())
idleTimeoutFuzz := time.Duration(rand.Int63n(int64(idleTimeoutFuzzRange)))
// It's safe to modify this *redis.Options in this way - a new data
// structure is created and passed to this function for each client created
clientOptions.IdleTimeout = baseIdleTimeout + idleTimeoutFuzz
return redis.NewClient(clientOptions)
}
Resetting the System to a Good State
When this issue happens, we are generally able to reset it to a good state by telling producers to stop producing new work to the system, which in turn means our consumer goroutines running the algorithm detailed later in this issue stop seeing new work items and their algorithm essentially just becomes SRANDMEMBER pending 10
-> sleep 10ms -> repeat. We can then resume producer writes and the system goes back into a good state without connections being constantly reestablished. This is an improvement from our solution in the previous issue: completely stopping all services and then turning them back on after some period of time. However, it's not a tenable solution to this issue if we are to continue onboarding more work to the system in production.
Steps to Reproduce
The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both go-redis
and ECRedis. We've not been able to consistently trigger this issue since it often happens when we're not load testing. I'd be interested to know if the other commenters on the original issue, @klakekent and @pedrocunha, have had any luck reproducing the issue. It sounds like it's been difficult to reproduce for them as well.
Context (Environment)
We're running a service that has a simple algorithm for claiming work from a Redis set, doing something (unrelated to Redis) with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:
-
SRANDMEMBER pending 10
- grab up to 10 random items from the pool of available work -
ZADD in_progress <current_timestamp> <grabbed_item>
for each of our items we got in the previous step - Any work items we weren't able to
ZADD
have been claimed by some other instance of the service, skip them - Once we're done with a work item,
SREM pending <grabbed_item>
- Periodically
ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago>
so that claimed items aren't claimed forever, but items can only be claimed exactly once during a 5 second window
Work item producers simply SADD pending <work_item>
but the load they produce is miniscule compared to the consumers running the algorithm outlined above.
Currently we run this algorithm on 22 EC2 instances, each running one service. We've configured our ClusterClient
to use a PoolSize
of 10
(reduced from the default of 20
on these instances) for each child Client
it creates. Each service has 25 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.
At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.
Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using github.com/go-redis/redis/v8 v8.0.0
, github.com/go-redis/redis/v8 v8.11.2
, and github.com/go-redis/redis/v8 v8.11.4
- all 3 have run into this issue. We are now using a fork of v8.11.4
(see added commits here) with the following changes (some suggested in the previous issue, some thought of by us):
- Add logs in spots where bad connections are removed
- Add hooks in
ClusterClient.process
method for logging/metrics around various anomalies during processing commands (MOVED
/ASK
responses, errors getting nodes, node failures). We do not see these while the issue is happening.
We're currently running with a 24-shard ElastiCache Redis cluster with TLS enabled, where each shard is a primary/replica pair of cache.r6g.large
instances.
Detailed Description
N/A
Possible Implementation
N/A
@enjmusic thanks for creating the isssue.
Add logs in spots where bad connections are removed
Do you see any logs? I guess not which may indicate that the idle conns reaper is indeed the problem. You could try to disable it completely to confirm that. Or add some logging here and here if you really need IdleTimeout
.
@vmihailenco if I remember correctly, we were seeing a mixture of I/O timeout and context deadline exceeded logs when this issue was triggered.
We did completely disable idle connection reaping for a while (both the reaper goroutine and the IdleTimeout
) and while this made the issue happen less often, we did still see it triggered once! So while it may be a contributing factor I don't know if it's the root cause. We haven't seen this issue lately after:
- Reducing # of goroutines exercising our Redis claim logic
- Lowering pool size
- Fuzzing the interval at which each individual cluster node's Client cleans idle connections
but again, it's difficult to believe that we've completely addressed the issue or just made it exceedingly unlikely at our current scale. I'd be interested to know if we can get any more insights from @klakekent and @pedrocunha, assuming they're still seeing the issue periodically.
I/O timeout and context deadline
Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.
Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.
Our web servers running go-redis
code, as well as our ElastiCache Redis cluster, are both overprovisioned for the amount of work they're doing. They're always pretty bored when this issue is triggered. Given that:
- we have a ton of CPU/memory headroom
- multiple others have reported the same issue
- we use
go-redis
in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contexts - @klakekent, like us, was told by AWS support that there was nothing suspicious going on from AWS's view of the ElastiCache Redis clusters
With these points in mind, I think it's very reasonable to consider them enough evidence to continue taking a closer look at go-redis
. I would love to see more engagement around getting to the bottom of this issue rather than dismissing it a second time.
According to your log report closing bad conn: context deadline exceeded
, I checked the working process of go-redis in detail, it only appears in initConn
and withConn
, I believe this error is caused by context timeout .
But according to your reply:
- we use
go-redis
in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contexts
You did not encounter this problem in non-redis cluster mode, so what is your context timeout in non-redis cluster mode? Is an ACL operation required(initConn
)?
skip initConn:
if c.opt.Password == "" && c.opt.DB == 0 && !c.opt.readOnly && c.opt.OnConnect == nil {
// skip initConn
}
go-redis ClusterClient
additionally caches the results of cluster slots
, but it is executed asynchronously (except when each node is initialized), but if your redis-cluster nodes change frequently, it may greatly increase the command execution time (ask moved
).
In addition, you may need to consider network issues (aws vpc), or the time spent processing commands by redis-cluster(> context.Timeout).
So, we have a theory about what might be going on. AWS ElastiCache Redis best practices dictate that one should use the "Cluster Configuration Endpoint" for cluster discovery, which we took to mean that it was ok to use that as our one seed address for the ClusterClient
, assuming that every time it was reached out to it would route to a different node in the cluster. The auto discovery section of the ElastiCache FAQ states that
the Amazon ElastiCache cluster is also given a unique Configuration Endpoint which is a DNS Record that is valid for the lifetime of the cluster. This DNS Record contains the DNS Names of the nodes that belong to the cluster. Amazon ElastiCache will ensure that the Configuration Endpoint always points to at least one such “target” node. A query to the target node then returns endpoints for all the nodes of the cluster in question
which we took to mean that it would balance between all nodes. However, there's likely some DNS caching going on. When I nslookup
our Configuration Endpoint I see the full list of IPs for all primaries and replicas, but when I openssl s_client -nocommands -connect <configuration_endpoint>
I always end up on the same node in the cluster. I think this means that when the ClusterClient
is seeded with just the Configuration Endpoint, in any situation where several goroutines executing code in cluster.go
all need to fall back on the initial Addrs
list provided when the ClusterClient
was created, they'll all hammer the same shard.
We may be able to get around this issue by performing a LookupIP
or similar query to get our Addrs
list so that the initial list of nodes lends itself well to a fair division of lookups across nodes when needed.
We're probably going to implement this safety measure no matter what, but my main question after looking into the Configuration Endpoint (and its use as the only seed address for a ClusterClient
) a little more is: what are the different spots in cluster.go
where we fall back on the initially provided Addrs
list from the ClusterOptions
? I started looking deeply into the code for this today but figured I'd miss some stuff given that there's a lot of code for clusterNodes
and clusterStateHolder
.
There seems to be no evidence of a DNS problem, in the logs you see "context deadline exceeded", we should determine where more time is being consumed causing the context to time out.
We were unable to reproduce this issue in a test environment, you can add more logs to go-redis to locate where time is consuming:
https://github.com/go-redis/redis/blob/40e84a79bfb4fafc2d7b6770f7434311de08ce48/cluster.go#L771
We have a theory about what might be going on... The auto discovery section...
@enjmusic Have you seen this post https://medium.com/tinder-engineering/taming-elasticache-with-auto-discovery-at-scale-dc5e7c4c9ad0 ?
Tinder found a bug in their Java client failing to dynamically reconfigure which IP it was connecting to following AWS Elasticache cluster maintenance events.
We finally saw this issue again during the night yesterday, but it affected eight different nodes, which we've never seen before. Previously it was always one. This may or may not disprove the cluster configuration endpoint issue. Again, temporarily stopping new ingress of jobs into Redis resolved the issue.
@dzqv just read the blog post! It's an interesting study but doesn't look like our issue where connection floods occur to some nodes independent of any cluster maintenance events
This time we did see the issue happen within a couple minutes of a
Certificate renewed for replication group <REDACTED> on Sun Mar 20 08:14:05 UTC 2022
... curious about whether or not that could put clients into a bad state
We still don't know what caused this but a couple of things:
- This only happened for a particular service that uses context timeouts
- We have not seen this issue re-occur since we upgraded to Redis 6.2
- On the online upgrade to 6.2 there are multiple failover events and we didn't experience any issues in regards to new connections
- Our limiter callback seems to trigger frequently but we aren't sure if this is "saving" us
Just wanna add that we are also seeing this issue. We are using clustered Elasticache Redis 6.0.5 on r6g instances with TLS enabled, our clients are running on EC2 as well.
We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.
A few other params we also tuned to mitigate this issue:
- lowering poolSize, and also make number of connections stable by setting
PoolSize = MinIdleConns
- disabling idle conn reap
- add random jitter in dialer
- stick to higher timeouts, set dialTimeout = 5s; readTimeout = writeTimeout = PoolTimeout = 1s
- disabling retry by setting
MaxRetries
to -1
The changes we made are helpful to reduce the symptom, but it's still happening from time to time. Almost always it started with read timeouts on a single node, and then a huge TLS connection spike to that node. It issue happens and self-resolves within a minutes
@pedrocunha Are there any re-occurrences on Elasticache Redis 6.2 since you posted? https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/supported-engine-versions.html#redis-version-6.2 Their 6.2 version added multi-cpu optimization for TLS connections, so maybe that's helping.
Today we saw this issue recur despite our Redis cluster being on
Engine Version Compatibility: 6.2.5
Not sure if that means the nodes are actually running 6.2 or just something to do with which commands work.
That said, we now have a circuit breaker implemented in our writers that stops writes when errors are detected for given keys, which has helped with autoremediation.
We had this new logging in place on our go-redis
fork for cluster state issues and the like, but didn't see any of it trigger during the issue.
Actually, our nodes are cache.r6g.large
right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!
We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.
This article should explain why that helps.
Are there other good reasons to allow a context to be passed to Redis commands at all in go-redis
since passing a deadline/timeout context is considered an antipattern according to this guidance?
Yes, tracing requires context. Besides, it is not an antipattern - it is just hard to use context timeouts safely.
But, perhaps, we should add an option to ignore context.Context
timeout. Or just ignore context.Context
when ReadTimeout/WriteTimeout are set.
I tried to lower the pool size from 1300 to 800, and go-redis seems to work stable for about 2 weeks.
We had one occurrence of this issue with r6g.xlarge and 6.2. Seems that tweaking the limiter to be more aggressive and reducing the pool size has helped us. We haven't had issues for nearly 3 weeks now. We haven't got rid of context deadline yet but it looks like a sensible thing to do, however in our case we would want these to be extremely small (< 200ms) as we rather still return something over waiting and clogging all the backend requests.
@enjmusic
Actually, our nodes are cache.r6g.large right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!
I understand wanting more than 1 vCPU but large has 2 vCPUs. Is there anything in particular of 2 vs many more vCPU in this case?
But, perhaps, we should add an option to ignore context.Context timeout. Or just ignore context.Context when ReadTimeout/WriteTimeout are set.
👍 on the latter
After experiencing issues like this we have a theory that when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right. Possibly half creating a new connection and immediately giving up.
I got this issue again yesterday, but this time i found the error message was "redis: connection pool timeout",and tried to check the metrics in ElastiCache console and find at that the new connection suddenly went to 1.6k, so seems only to decrease the pool size not fix this issue.
Anything new ? we started to experience this too on non managed redis cluster, random times. Seems like related to context timeouts. We have set a pretty big one to try to mitigate it (>5s), but it still happens occasionally. @vmihailenco maybe connections should not be closed on timeouts ? and only on unrecoverable errors ? or at least allowing this to be configured.
Hi, guys!
I was able to reproduce the issue.
Env
Redis 6.2.5 go-redis v8.11.4 go 1.17 macOs BigSur 11.6.3
Steps to reproduce:
- docker run -p 6379:6379 redis
- execute the following code:
package main
import (
"context"
"fmt"
"sync"
"time"
"github.com/go-redis/redis/v8"
)
func main() {
rc := redis.NewRing(&redis.RingOptions{
Addrs: map[string]string{"0": "localhost:6379"},
DB: 0,
MaxRetries: -1,
DialTimeout: 1 * time.Second,
ReadTimeout: 50 * time.Millisecond, // ReadTimeout should be lower then context timeout
PoolSize: 30,
MinIdleConns: 30,
IdleTimeout: -1,
IdleCheckFrequency: -1,
})
wg := sync.WaitGroup{}
for i := 0; i < 1000; i++ {
wg.Add(1)
go func(no int) {
defer wg.Done()
ctx := context.Background()
// comment out next 2 lines of code to resolve timeout problem
ctx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
defer cancel()
err := rc.Get(ctx, "foo").Err()
fmt.Printf("%d: %s\n", no, err)
}(i)
}
wg.Wait()
}
I also reproduced "use of closed network connection" problem with Redis pipeline:
package main
import (
"context"
"fmt"
"sync"
"time"
"github.com/go-redis/redis/v8"
)
func main() {
rc := redis.NewRing(&redis.RingOptions{
Addrs: map[string]string{"0": "localhost:6379"},
DB: 0,
MaxRetries: -1,
DialTimeout: 1 * time.Second,
ReadTimeout: 35 * time.Millisecond, // ReadTimeout should be lower then context timeout
PoolSize: 50,
MinIdleConns: 50,
IdleTimeout: -1,
IdleCheckFrequency: -1,
})
wg := sync.WaitGroup{}
for i := 0; i < 1000; i++ {
wg.Add(1)
go func(no int) {
defer wg.Done()
ctx := context.Background()
// comment out next 2 lines of code to resolve timeout problem
ctx, cancel := context.WithTimeout(ctx, 200*time.Millisecond)
defer cancel()
pipe := rc.Pipeline()
pipe.Get(ctx, "foo1")
pipe.Get(ctx, "foo2")
pipe.Get(ctx, "foo3")
_, err := pipe.Exec(ctx)
fmt.Printf("%d: %s\n", no, err)
}(i)
}
wg.Wait()
}
In the second example error with message "use of closed network connection" happens only with pipeline.
It seams that passing the context without timeout (as proposed here) works
As you can see ReadTimeout is lower than context timeout. When I do not set context timeout and use only ReadTimeout it works fine, but as far as I set context timeout I face timeouts and reuse of closed connection (in case of pipeline).
@yoni-veksler-px
maybe connections should not be closed on timeouts ?
I also think that connections should not be closed on timeouts.
@pete-woods
when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right.
Yes, it may be true.
@vmihailenco can you recheck my examples?
Are there any updates here? We're experiencing exactly the same issues others have discussed here. We get spiking new connections that cause the Redis CPU to spike on at least one node. From there, everything goes to hell for a few minutes until the cluster recovers on its own.
We have tried tuning a lot of our config options and are currently running with a pool size of 64, but when we start up our service, we see a spike in connections that makes no sense given our settings. Immediately on start up, we see new connections spike to 10x what we'd expect when doing the math of pods X pool size.
This, in turn, causes issues as kubernetes rolls out more pods throughout the day as our traffic increases.
I updated our calls to go-redis
to use a context.Background()
instead of the context that we typically pass through, but we still see connection pool timeout
errors.
For those who face the issue and perform primarily pipelined operations I would recommend giving github.com/joomcode/redispipe a shot which gives you a lot of freedom with no connection pool to worry about:
All known Golang redis connectors use a connection-per-request model with a connection pool, and provide only explicit pipelining.
This connector was created as implicitly pipelined from the ground up to achieve maximum performance in a highly concurrent environment. It writes all requests to single connection to redis, and continuously reads answers from another goroutine.
we are also experiencing a similar issue. We use Cluster client with no context timeouts. Apparently the cause is the increasing amount of new connections which takes elasticache CPU super high