go-redis icon indicating copy to clipboard operation
go-redis copied to clipboard

Constantly Reestablishing Connections to AWS ElastiCache Redis in Cluster Mode (Continued)

Open ianjhoffman opened this issue 2 years ago • 39 comments

(See previous issue for additional context and discussion)

Expected Behavior

A ClusterClient with bare-minimum settings for connecting to an AWS ElastiCache Redis cluster (from here on referred to as ECRedis) will, at no point, contribute to a scenario in which it is constantly reestablishing connections to an ECRedis cluster node.

Example ClusterClient config:

redis.NewClusterClient(&redis.ClusterOptions{
    Addrs: []string{redisAddr},
    TLSConfig: &tls.Config{},
})

Current Behavior

(The Current Behavior section in the original issue is still accurate to this issue)

Occasionally, we see connections being constantly re-established to one of our ECRedis cluster nodes at the limit of how many new connections are possible (~15k/minute is the reported rate). Redis nodes are essentially single-threaded and negotiating TLS for new connections takes up 100% of this node's CPU, preventing it from doing any other work. The time at which this issue occurs seems random, and we cannot correlate it to:

  • Amount of load on the system (# Redis commands)
  • Events happening on the ECRedis cluster (resharding, cycling out nodes, failovers, etc.)
  • Any other issues with the ECRedis cluster not normally visible in the AWS Console (we consulted with AWS support for this one)
  • Service restarts for our Go service that communicates with ECRedis

When this issue happens, running CLIENT LIST on the affected Redis node shows age=0 or age=1 for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.

After discussion on the previous issue we are no longer the only ones experiencing this unknown problem, and it is blocking us from further relying on this service we've been building (and delaying full use of) for well over 6 months now.

Possible Solution

Redis ClusterClient should react more gracefully to that quickly devolves into constantly reestablishing connections to some node in the AWS ElastiCache Redis cluster.

After the previous issue, we have tried a variety of approaches to mitigate this problem, none of which have solved it entirely:

  • Increasing all timeouts in context.Contexts passed to Redis commands to 1 second or greater
  • Reducing # of goroutines exercising our Redis claim logic
  • Lowering connection pool size
  • Moving all operations out of MULTI/EXEC pipelines
  • Completely disabling idle connection reaping
  • Fuzzing the interval at which each individual cluster node's Client cleans idle connections
  • Changing the instance type of the ElastiCache Redis nodes
  • Rebalancing/increasing shard count for the ElastiCache Redis cluster

We are seeing the issue a lot less often after

  • Reducing # of goroutines exercising our Redis claim logic
  • Lowering pool size
  • Fuzzing the interval at which each individual cluster node's Client cleans idle connections

but are not confident that these changes constitute a permanent fix and are instead a consequence of there being fewer potential resources to overload an individual ECRedis cluster node.

Details About Fuzzing Idle Connection Cleanup

We opted to set IdleTimeout and IdleCheckFrequency on the redis.ClusterOptions passed to our client to -1 and added a custom NewClient implementation:

// A new redis.Client is created for each Redis cluster node to which a redis.ClusterClient
// sends commands. Make each redis.Client clean idle connections at a different cadence.
redisOpts.NewClient = func(clientOptions *redis.Options) *redis.Client {
	rand.Seed(time.Now().UnixNano())
	idleTimeoutFuzz := time.Duration(rand.Int63n(int64(idleTimeoutFuzzRange)))

	// It's safe to modify this *redis.Options in this way - a new data
	// structure is created and passed to this function for each client created
	clientOptions.IdleTimeout = baseIdleTimeout + idleTimeoutFuzz

	return redis.NewClient(clientOptions)
}

Resetting the System to a Good State

When this issue happens, we are generally able to reset it to a good state by telling producers to stop producing new work to the system, which in turn means our consumer goroutines running the algorithm detailed later in this issue stop seeing new work items and their algorithm essentially just becomes SRANDMEMBER pending 10 -> sleep 10ms -> repeat. We can then resume producer writes and the system goes back into a good state without connections being constantly reestablished. This is an improvement from our solution in the previous issue: completely stopping all services and then turning them back on after some period of time. However, it's not a tenable solution to this issue if we are to continue onboarding more work to the system in production.

Steps to Reproduce

The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both go-redis and ECRedis. We've not been able to consistently trigger this issue since it often happens when we're not load testing. I'd be interested to know if the other commenters on the original issue, @klakekent and @pedrocunha, have had any luck reproducing the issue. It sounds like it's been difficult to reproduce for them as well.

Context (Environment)

We're running a service that has a simple algorithm for claiming work from a Redis set, doing something (unrelated to Redis) with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:

  • SRANDMEMBER pending 10 - grab up to 10 random items from the pool of available work
  • ZADD in_progress <current_timestamp> <grabbed_item> for each of our items we got in the previous step
  • Any work items we weren't able to ZADD have been claimed by some other instance of the service, skip them
  • Once we're done with a work item, SREM pending <grabbed_item>
  • Periodically ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago> so that claimed items aren't claimed forever, but items can only be claimed exactly once during a 5 second window

Work item producers simply SADD pending <work_item> but the load they produce is miniscule compared to the consumers running the algorithm outlined above.

Currently we run this algorithm on 22 EC2 instances, each running one service. We've configured our ClusterClient to use a PoolSize of 10 (reduced from the default of 20 on these instances) for each child Client it creates. Each service has 25 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.

At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.

Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using github.com/go-redis/redis/v8 v8.0.0, github.com/go-redis/redis/v8 v8.11.2, and github.com/go-redis/redis/v8 v8.11.4 - all 3 have run into this issue. We are now using a fork of v8.11.4 (see added commits here) with the following changes (some suggested in the previous issue, some thought of by us):

  • Add logs in spots where bad connections are removed
  • Add hooks in ClusterClient.process method for logging/metrics around various anomalies during processing commands (MOVED/ASK responses, errors getting nodes, node failures). We do not see these while the issue is happening.

We're currently running with a 24-shard ElastiCache Redis cluster with TLS enabled, where each shard is a primary/replica pair of cache.r6g.large instances.

Detailed Description

N/A

Possible Implementation

N/A

ianjhoffman avatar Mar 14 '22 18:03 ianjhoffman

@enjmusic thanks for creating the isssue.

Add logs in spots where bad connections are removed

Do you see any logs? I guess not which may indicate that the idle conns reaper is indeed the problem. You could try to disable it completely to confirm that. Or add some logging here and here if you really need IdleTimeout.

vmihailenco avatar Mar 15 '22 07:03 vmihailenco

@vmihailenco if I remember correctly, we were seeing a mixture of I/O timeout and context deadline exceeded logs when this issue was triggered.

We did completely disable idle connection reaping for a while (both the reaper goroutine and the IdleTimeout) and while this made the issue happen less often, we did still see it triggered once! So while it may be a contributing factor I don't know if it's the root cause. We haven't seen this issue lately after:

  • Reducing # of goroutines exercising our Redis claim logic
  • Lowering pool size
  • Fuzzing the interval at which each individual cluster node's Client cleans idle connections

but again, it's difficult to believe that we've completely addressed the issue or just made it exceedingly unlikely at our current scale. I'd be interested to know if we can get any more insights from @klakekent and @pedrocunha, assuming they're still seeing the issue periodically.

ianjhoffman avatar Mar 15 '22 18:03 ianjhoffman

I/O timeout and context deadline

Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.

vmihailenco avatar Mar 17 '22 13:03 vmihailenco

Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.

Our web servers running go-redis code, as well as our ElastiCache Redis cluster, are both overprovisioned for the amount of work they're doing. They're always pretty bored when this issue is triggered. Given that:

  • we have a ton of CPU/memory headroom
  • multiple others have reported the same issue
  • we use go-redis in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contexts
  • @klakekent, like us, was told by AWS support that there was nothing suspicious going on from AWS's view of the ElastiCache Redis clusters

With these points in mind, I think it's very reasonable to consider them enough evidence to continue taking a closer look at go-redis. I would love to see more engagement around getting to the bottom of this issue rather than dismissing it a second time.

ianjhoffman avatar Mar 17 '22 21:03 ianjhoffman

According to your log report closing bad conn: context deadline exceeded, I checked the working process of go-redis in detail, it only appears in initConn and withConn, I believe this error is caused by context timeout .

But according to your reply:

  • we use go-redis in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contexts

You did not encounter this problem in non-redis cluster mode, so what is your context timeout in non-redis cluster mode? Is an ACL operation required(initConn)?

skip initConn:

if c.opt.Password == "" && c.opt.DB == 0 && !c.opt.readOnly && c.opt.OnConnect == nil {
        // skip initConn
}

monkey92t avatar Mar 18 '22 07:03 monkey92t

go-redis ClusterClient additionally caches the results of cluster slots, but it is executed asynchronously (except when each node is initialized), but if your redis-cluster nodes change frequently, it may greatly increase the command execution time (ask moved).

In addition, you may need to consider network issues (aws vpc), or the time spent processing commands by redis-cluster(> context.Timeout).

monkey92t avatar Mar 18 '22 08:03 monkey92t

So, we have a theory about what might be going on. AWS ElastiCache Redis best practices dictate that one should use the "Cluster Configuration Endpoint" for cluster discovery, which we took to mean that it was ok to use that as our one seed address for the ClusterClient, assuming that every time it was reached out to it would route to a different node in the cluster. The auto discovery section of the ElastiCache FAQ states that

the Amazon ElastiCache cluster is also given a unique Configuration Endpoint which is a DNS Record that is valid for the lifetime of the cluster. This DNS Record contains the DNS Names of the nodes that belong to the cluster. Amazon ElastiCache will ensure that the Configuration Endpoint always points to at least one such “target” node. A query to the target node then returns endpoints for all the nodes of the cluster in question

which we took to mean that it would balance between all nodes. However, there's likely some DNS caching going on. When I nslookup our Configuration Endpoint I see the full list of IPs for all primaries and replicas, but when I openssl s_client -nocommands -connect <configuration_endpoint> I always end up on the same node in the cluster. I think this means that when the ClusterClient is seeded with just the Configuration Endpoint, in any situation where several goroutines executing code in cluster.go all need to fall back on the initial Addrs list provided when the ClusterClient was created, they'll all hammer the same shard.

We may be able to get around this issue by performing a LookupIP or similar query to get our Addrs list so that the initial list of nodes lends itself well to a fair division of lookups across nodes when needed.

We're probably going to implement this safety measure no matter what, but my main question after looking into the Configuration Endpoint (and its use as the only seed address for a ClusterClient) a little more is: what are the different spots in cluster.go where we fall back on the initially provided Addrs list from the ClusterOptions? I started looking deeply into the code for this today but figured I'd miss some stuff given that there's a lot of code for clusterNodes and clusterStateHolder.

ianjhoffman avatar Mar 18 '22 17:03 ianjhoffman

There seems to be no evidence of a DNS problem, in the logs you see "context deadline exceeded", we should determine where more time is being consumed causing the context to time out.

We were unable to reproduce this issue in a test environment, you can add more logs to go-redis to locate where time is consuming:

https://github.com/go-redis/redis/blob/40e84a79bfb4fafc2d7b6770f7434311de08ce48/cluster.go#L771

monkey92t avatar Mar 19 '22 03:03 monkey92t

We have a theory about what might be going on... The auto discovery section...

@enjmusic Have you seen this post https://medium.com/tinder-engineering/taming-elasticache-with-auto-discovery-at-scale-dc5e7c4c9ad0 ?

Tinder found a bug in their Java client failing to dynamically reconfigure which IP it was connecting to following AWS Elasticache cluster maintenance events.

dzqv avatar Mar 20 '22 16:03 dzqv

We finally saw this issue again during the night yesterday, but it affected eight different nodes, which we've never seen before. Previously it was always one. This may or may not disprove the cluster configuration endpoint issue. Again, temporarily stopping new ingress of jobs into Redis resolved the issue.

ianjhoffman avatar Mar 21 '22 17:03 ianjhoffman

@dzqv just read the blog post! It's an interesting study but doesn't look like our issue where connection floods occur to some nodes independent of any cluster maintenance events

ianjhoffman avatar Mar 21 '22 18:03 ianjhoffman

This time we did see the issue happen within a couple minutes of a

Certificate renewed for replication group <REDACTED> on Sun Mar 20 08:14:05 UTC 2022

... curious about whether or not that could put clients into a bad state

ianjhoffman avatar Mar 21 '22 18:03 ianjhoffman

We still don't know what caused this but a couple of things:

  1. This only happened for a particular service that uses context timeouts
  2. We have not seen this issue re-occur since we upgraded to Redis 6.2
  3. On the online upgrade to 6.2 there are multiple failover events and we didn't experience any issues in regards to new connections
  4. Our limiter callback seems to trigger frequently but we aren't sure if this is "saving" us

pedrocunha avatar Mar 24 '22 12:03 pedrocunha

Just wanna add that we are also seeing this issue. We are using clustered Elasticache Redis 6.0.5 on r6g instances with TLS enabled, our clients are running on EC2 as well.

We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.

A few other params we also tuned to mitigate this issue:

  1. lowering poolSize, and also make number of connections stable by setting PoolSize = MinIdleConns
  2. disabling idle conn reap
  3. add random jitter in dialer
  4. stick to higher timeouts, set dialTimeout = 5s; readTimeout = writeTimeout = PoolTimeout = 1s
  5. disabling retry by setting MaxRetries to -1

The changes we made are helpful to reduce the symptom, but it's still happening from time to time. Almost always it started with read timeouts on a single node, and then a huge TLS connection spike to that node. It issue happens and self-resolves within a minutes

@pedrocunha Are there any re-occurrences on Elasticache Redis 6.2 since you posted? https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/supported-engine-versions.html#redis-version-6.2 Their 6.2 version added multi-cpu optimization for TLS connections, so maybe that's helping.

allanbenW avatar Mar 30 '22 19:03 allanbenW

Today we saw this issue recur despite our Redis cluster being on

Engine Version Compatibility: 6.2.5

Not sure if that means the nodes are actually running 6.2 or just something to do with which commands work.

That said, we now have a circuit breaker implemented in our writers that stops writes when errors are detected for given keys, which has helped with autoremediation.

ianjhoffman avatar Mar 30 '22 19:03 ianjhoffman

We had this new logging in place on our go-redis fork for cluster state issues and the like, but didn't see any of it trigger during the issue.

ianjhoffman avatar Mar 30 '22 19:03 ianjhoffman

Actually, our nodes are cache.r6g.large right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!

ianjhoffman avatar Mar 30 '22 20:03 ianjhoffman

We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.

This article should explain why that helps.

vmihailenco avatar Mar 31 '22 14:03 vmihailenco

Are there other good reasons to allow a context to be passed to Redis commands at all in go-redis since passing a deadline/timeout context is considered an antipattern according to this guidance?

ianjhoffman avatar Mar 31 '22 17:03 ianjhoffman

Yes, tracing requires context. Besides, it is not an antipattern - it is just hard to use context timeouts safely.

vmihailenco avatar Apr 01 '22 08:04 vmihailenco

But, perhaps, we should add an option to ignore context.Context timeout. Or just ignore context.Context when ReadTimeout/WriteTimeout are set.

vmihailenco avatar Apr 01 '22 08:04 vmihailenco

I tried to lower the pool size from 1300 to 800, and go-redis seems to work stable for about 2 weeks.

klakekent avatar Apr 08 '22 05:04 klakekent

We had one occurrence of this issue with r6g.xlarge and 6.2. Seems that tweaking the limiter to be more aggressive and reducing the pool size has helped us. We haven't had issues for nearly 3 weeks now. We haven't got rid of context deadline yet but it looks like a sensible thing to do, however in our case we would want these to be extremely small (< 200ms) as we rather still return something over waiting and clogging all the backend requests.

@enjmusic

Actually, our nodes are cache.r6g.large right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!

I understand wanting more than 1 vCPU but large has 2 vCPUs. Is there anything in particular of 2 vs many more vCPU in this case?

But, perhaps, we should add an option to ignore context.Context timeout. Or just ignore context.Context when ReadTimeout/WriteTimeout are set.

👍 on the latter

pedrocunha avatar Apr 11 '22 14:04 pedrocunha

After experiencing issues like this we have a theory that when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right. Possibly half creating a new connection and immediately giving up.

pete-woods avatar Apr 26 '22 06:04 pete-woods

I got this issue again yesterday, but this time i found the error message was "redis: connection pool timeout",and tried to check the metrics in ElastiCache console and find at that the new connection suddenly went to 1.6k, so seems only to decrease the pool size not fix this issue.

klakekent avatar May 11 '22 01:05 klakekent

Anything new ? we started to experience this too on non managed redis cluster, random times. Seems like related to context timeouts. We have set a pretty big one to try to mitigate it (>5s), but it still happens occasionally. @vmihailenco maybe connections should not be closed on timeouts ? and only on unrecoverable errors ? or at least allowing this to be configured.

yoni-veksler avatar Jun 07 '22 21:06 yoni-veksler

Hi, guys!

I was able to reproduce the issue.

Env

Redis 6.2.5 go-redis v8.11.4 go 1.17 macOs BigSur 11.6.3

Steps to reproduce:

  1. docker run -p 6379:6379 redis
  2. execute the following code:
package main

import (
	"context"
	"fmt"
	"sync"
	"time"

	"github.com/go-redis/redis/v8"
)

func main() {
	rc := redis.NewRing(&redis.RingOptions{
		Addrs:              map[string]string{"0": "localhost:6379"},
		DB:                 0,
		MaxRetries:         -1,
		DialTimeout:        1 * time.Second,
		ReadTimeout:        50 * time.Millisecond, // ReadTimeout should be lower then context timeout
		PoolSize:           30,
		MinIdleConns:       30,
		IdleTimeout:        -1,
		IdleCheckFrequency: -1,
	})

	wg := sync.WaitGroup{}

	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func(no int) {
			defer wg.Done()

			ctx := context.Background()
			// comment out next 2 lines of code to resolve timeout problem
			ctx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
			defer cancel()

			err := rc.Get(ctx, "foo").Err()

			fmt.Printf("%d: %s\n", no, err)
		}(i)
	}

	wg.Wait()
}

I also reproduced "use of closed network connection" problem with Redis pipeline:

package main

import (
	"context"
	"fmt"
	"sync"
	"time"

	"github.com/go-redis/redis/v8"
)

func main() {
	rc := redis.NewRing(&redis.RingOptions{
		Addrs:              map[string]string{"0": "localhost:6379"},
		DB:                 0,
		MaxRetries:         -1,
		DialTimeout:        1 * time.Second,
		ReadTimeout:        35 * time.Millisecond, // ReadTimeout should be lower then context timeout
		PoolSize:           50,
		MinIdleConns:       50,
		IdleTimeout:        -1,
		IdleCheckFrequency: -1,
	})

	wg := sync.WaitGroup{}

	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func(no int) {
			defer wg.Done()

			ctx := context.Background()
			// comment out next 2 lines of code to resolve timeout problem
			ctx, cancel := context.WithTimeout(ctx, 200*time.Millisecond)
			defer cancel()

			pipe := rc.Pipeline()

			pipe.Get(ctx, "foo1")
			pipe.Get(ctx, "foo2")
			pipe.Get(ctx, "foo3")

			_, err := pipe.Exec(ctx)

			fmt.Printf("%d: %s\n", no, err)
		}(i)
	}

	wg.Wait()
}

In the second example error with message "use of closed network connection" happens only with pipeline.

It seams that passing the context without timeout (as proposed here) works

As you can see ReadTimeout is lower than context timeout. When I do not set context timeout and use only ReadTimeout it works fine, but as far as I set context timeout I face timeouts and reuse of closed connection (in case of pipeline).

@yoni-veksler-px

maybe connections should not be closed on timeouts ?

I also think that connections should not be closed on timeouts.

@pete-woods

when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right.

Yes, it may be true.

@vmihailenco can you recheck my examples?

MEZk avatar Jun 21 '22 21:06 MEZk

Are there any updates here? We're experiencing exactly the same issues others have discussed here. We get spiking new connections that cause the Redis CPU to spike on at least one node. From there, everything goes to hell for a few minutes until the cluster recovers on its own.

We have tried tuning a lot of our config options and are currently running with a pool size of 64, but when we start up our service, we see a spike in connections that makes no sense given our settings. Immediately on start up, we see new connections spike to 10x what we'd expect when doing the math of pods X pool size.

This, in turn, causes issues as kubernetes rolls out more pods throughout the day as our traffic increases.

I updated our calls to go-redis to use a context.Background() instead of the context that we typically pass through, but we still see connection pool timeout errors.

ameliagapin avatar Jan 31 '23 18:01 ameliagapin

For those who face the issue and perform primarily pipelined operations I would recommend giving github.com/joomcode/redispipe a shot which gives you a lot of freedom with no connection pool to worry about:

All known Golang redis connectors use a connection-per-request model with a connection pool, and provide only explicit pipelining.

This connector was created as implicitly pipelined from the ground up to achieve maximum performance in a highly concurrent environment. It writes all requests to single connection to redis, and continuously reads answers from another goroutine.

konradreiche avatar Feb 17 '23 23:02 konradreiche

we are also experiencing a similar issue. We use Cluster client with no context timeouts. Apparently the cause is the increasing amount of new connections which takes elasticache CPU super high

juancresc avatar Mar 29 '23 16:03 juancresc