Lettuce Sharded PubSub Resubscribe Possible Issue
Bug Report
I'm experiencing an issue that seems similar to https://github.com/redis/lettuce/issues/2940 in that Lettuce does not seem to be resubscribing Sharded PubSub subscriptions automatically, except I'm using Lettuce 6.5.5.RELEASE, in which the referenced issue should have been fixed.
I may be misunderstanding how things are supposed to work or have misconfigured something, so if that's the case, please let me know.
Current Behavior
Assume that you have a number of applications connected to a Redis cluster with two shards, using .connectPubSub(...).async().ssubscribe(topic).... The subscriptions are distributed across the two shards.
Then, remove one shard (either manually, or via autoscaling policy, such as in an AWS ElastiCache deployment).
By watching debug logs, it seems to me that when subscriptions are made with the regular non-sharded .subscribe call, once Lettuce is disconnected from the shard going-away, and reconnects to the new shard, Lettuce issues another SUBSCRIBE command. This can also be verified by connecting to the Redis cluster via CLI and running CLIENT LIST (I am able to see the connection that was transferred over, and that the latest command run on that connection was subscribe) and with PUBSUB CHANNELS.
However, when subscriptions are made with the sharded .ssubscribe, Lettuce is able to reconnect to the new shard, but there are no debug logs indicating a SSUBSCRIBE command was made. Connecting via CLI shows with CLIENT LIST that the application did successfully reconnect to the new shard, but the latest command is cluster|myid instead of ssubscribe and PUBSUB SHARDCHANNELS shows only the subscriptions that were originally created on that shard and none of the transferred connections.
This difference in behavior (where SUBSCRIBE reconnects successfully, but SSUBSCRIBE does not) also applies to test-initiated failovers (initiated via the AWS ElastiCache console), with the same outcome.
The result is that some number of Sharded PubSub messages are lost because there are no active subscribers for those messages.
Input Code
I can paste my Lettuce client configuration if desired or if that would be helpful, in case you think this might be a problem with my configuration.
Expected behavior/code
Since SUBSCRIBE seems to automatically resubscribe on failovers and auto-scale-in for Redis clusters, I would have expected SSUBSCRIBE to also do the same.
Environment
- Lettuce version(s): 6.5.5.RELEASE
- Redis version: 7.2.4 (Valkey 8.0.1)
Additional context
As a tangentially related question, does Lettuce handle slot movement/rebalancing for SSUBSCRIBE, such as when nodes are added and slots are redistributed? I couldn't really find documentation around how slot movement in Sharded Pub/Sub works in general and I'm not familiar enough with Lettuce and Redis to figure it out from reading the code, though I gave it a try. Mostly my concern is if it's something I'd need to implement myself, or if that's handled by using Lettuce.
I did some additional testing and created a simple Java console application, with only the Lettuce 6.5.5.RELEASE client and Netty 4.1.119.Final. I configured six RedisClusterClient instances, with two clients handling publish and spublish respectively, two subscriber clients, and two sharded subscriber clients (for a total of four topics). I was able to confirm that on both adding new nodes (scaling up) the ElastiCache cluster, sharded subscriptions can stop receiving messages and on downscaling/removing nodes, sharded subscriptions can also stop receiving messages.
In both cases, the non-sharded subscriptions seemed unaffected, which is differing behavior.
If it isn't a configuration mistake on my end or perhaps intended (perhaps I'm supposed to override in RedisPubSubListener the sunsubscribed method or listen to some topology refresh event to manually re-establish the sharded subscription when slots move?) then it seems to be an unexpected bug of some type?
Without pasting the code, the general options I have configured are:
- ClientResources using a custom
DnsAddressResolverGroupto specify things like theNioDatagramChanneltype,DnsNameResolverChannelStrategy.ChannelPerResolution, specifying retries on timeout, TCP fallback, no DNS caching, etc - The implementation of
RedisPubSubListeneronly overridesmessage(...) - SocketOptions with extended keepalive options enabled, idle set at 30 seconds and a few seconds interval after idle
- TcpUserTimeout set to 30 seconds
- ConnectTimeout set to 10 seconds
- PeriodicRefresh set to 60 seconds
- DynamicRefreshSources set to false
- EnableAllAdaptiveRefreshTriggers set to true
- CloseStaleConnections set to true
- NodeFilter enabled, filtering out
NodeFlag.FAILandNodeFlag.EVENTUAL_FAIL
Thank you!
Hey @ThePeterLuu ,
the team is quite busy right now, but we will get back to you as soon as possible.
@ThePeterLuu @tishun
I hit the same issue and opened a PR to fix it: https://github.com/redis/lettuce/pull/3400
The fix re-issues ssubscribe on unexpected sunsubscribe events to trigger a MOVED response, so the existing resubscribe logic kicks in.
@ThePeterLuu
Also, I solved this problem at the application layer in a similar way by catching sunsubscribe, saving at the queue, and explicitly re-subscribing when topology refreshed.
Thanks for working on that @byunjuneseok
We will try to have a look at this as soon as possible
Okay folks, huge apologies for how much time it took me to reach to this ticket. Here is what I think.
TL;DR - we should not attempt to re-subscribe in the case of re-sharding, but instead allow the user to handle appropriately by emitting an event.
Why?
Well let's start from the simple fact that the client application (the one consuming the driver) would not know there was a downtime (between the time the server stopped servicing this channel and the time the driver established a new subscription, which could be a couple of seconds). This means that it would be oblivious to the fact that there might be messages that were missed. The pub/sub does not give out guarantees for message delivery - however - some client applications might still want to do something specific during such downtimes.
Additionally most client libraries do not handle this. Node-redis has a specific event that is broadcasted when it detects a resharding event but it does nothing more. Similar solutions exist in redis-py, jedis, predis and go-redis.
Finally and most importantly - I am not entirely sure re-subscribing is the correct course of action for all use cases. The re-sharding process obviously re-balances the cluster and the subscriptions might need to be re-balanced too. Client applications know that and the driver should be oblivious to this logic.
@uglide and @a-TODO-rov what's your take on the subject?