fred.rs icon indicating copy to clipboard operation
fred.rs copied to clipboard

Unable to reconnect to Redis Cluster in Kubernetes after cluster rollout

Open zombiezen opened this issue 2 years ago • 4 comments

Thanks for the library! We're trying to use this library in a server running in Kubernetes connecting to a Redis Cluster running in the same Kubernetes cluster. We're connecting to Redis using redis-cluster://my-redis-cluster:6379 as the RedisConfig with a RedisPool.

We noticed that when we go to roll out a new version of the Redis deployment that our fred-using server will hang trying to send commands to the Redis Cluster. The logs we're seeing are of the form:

Error creating or using backchannel for cluster nodes: Redis Error - kind: IO, details: Os { code: 110, kind: TimedOut, message: "Connection timed out" }
Failed to reconnect with error Redis Error - kind: Cluster, details: Failed to read cluster nodes on all possible backchannel servers.

Restarting our fred-using server with the same configuration is able to connect to the Redis Cluster just fine.

After adding some logging, it seems that what's happening is that the IP addresses on the cluster cycle out to different ones, but IIUC fred doesn't try to re-resolve the DNS address to reconnect to the cluster.

zombiezen avatar Aug 12 '22 22:08 zombiezen

Yeah, good find @zombiezen. This is one of the reasons why the next release will be a major release. The changes necessary to the connection management plumbing are too invasive to do in a patch or minor release, but they're required to fix this issue.

I'm about halfway through those changes now, and my goal is to release them by the end of the month or shortly after.

aembke avatar Aug 15 '22 16:08 aembke

Out of curiosity - would you find it useful to override DNS resolution logic? I've been debating whether to expose an interface for doing this (similar to hyper).

aembke avatar Aug 15 '22 17:08 aembke

Thanks! LMK if there's anything else I can do to help in a fix. We have some workarounds for this, but it would definitely save us some operational headaches to have it addressed.

AFAIK we don't need anything fancy for DNS resolution logic: the OS-provided resolver is fine for us.

zombiezen avatar Aug 15 '22 18:08 zombiezen

Sounds good. This is definitely a use case I plan on supporting, and I'll keep you updated on the status of the fix.

aembke avatar Aug 15 '22 23:08 aembke

Hey @aembke, any updates on this or any assistance we can provide? Not having this is causing us some grief in production.

zombiezen avatar Sep 19 '22 18:09 zombiezen

Yeah my apologies, getting back into it now. There's a couple PRs folks have submitted that I think are related to this. I'll take a look at the options and go with the fastest one that addresses this.

aembke avatar Sep 21 '22 02:09 aembke

@zombiezen Curious if you have any workaround in production?

casret avatar Oct 26 '22 23:10 casret

We've reverted to using the redis-rs library with a non-clustered Redis for the time being.

zombiezen avatar Oct 27 '22 01:10 zombiezen

We're working on a new service in rust and have chosen fred for it; the main reason why it was chosen is that it's the only redis library that does async, pooling, and clustering (all at the same time). We're also running into this exact problem & looking forward to trying out the next major release, thanks all.

sebastianhopkins-lh avatar Dec 07 '22 11:12 sebastianhopkins-lh

Quick update to the folks on this thread - I just published 6.0.0-beta.1 to crates.io. It has an entirely new implementation of the cluster interface and the repros I had for this issue seem to work now with the new version. If you have any feedback on the new interface please let me know.

aembke avatar Dec 12 '22 21:12 aembke

I've been using fred at this commit for a few weeks: https://github.com/aembke/fred.rs/commit/36798a2e3877dc8424109649b9a566584f8f6a2d and I'm not encountering this problem anymore, maybe @zombiezen can check too?

sebastianhopkins-lh avatar Jan 25 '23 16:01 sebastianhopkins-lh

I've been using fred at this commit for a few weeks: 36798a2 and I'm not encountering this problem anymore, maybe @zombiezen can check too?

I don't want to leave you hanging, but we've since reverted to a sharded REDIS system. I don't think will have time to verify this bug. At least not in our immediate roadmap.

Thanks for the patch though.

lytefast avatar Jan 25 '23 22:01 lytefast