lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

Undefined results when using ReadFrom with SCAN family commands

Open Dishant18 opened this issue 8 months ago • 14 comments

Bug Report

Current Behavior

We have observed that when using the ReadFrom setting with SCAN family of commands, the continuation requests (with cursor) is not guaranteed to go to the same node as the first request. Due to this behaviour, the results of SCAN family requests are undefined.

We have a master-replica Redis setup with sentinel-based discovery.

Input Code

Input Code
RedisURI redisURI = RedisURI.Builder.sentinel("sentinel-host").build();
RedisClient redisClient = RedisClient.create(redisURI);

StatefulRedisMasterReplicaConnection<byte[], byte[]> conn = MasterReplica.connect(redisClient, ByteArrayCodec.INSTANCE, redisURI);
conn.setReadFrom(ReadFrom.ANY);

byte[] key = getKey();
ScanArgs scanArgs = ScanArgs.Builder.limit(256);
ScanCursor cursor = ScanCursor.INITIAL;
MapScanCursor<byte[], byte[]> resp;
do {
            resp = conn.sync().hscan(key, cursor, scanArgs);
            cursor = ScanCursor.of(hscan.getCursor());
            // Convert and add the results to a map
} while (!resp.isFinished());

Expected behavior/code

SCAN family continuation requests need to be sticky to a Redis Node. When it is used in conjunction with ReadFrom, on every continuation request - a new node can be chosen - which seems like a bug.

Environment

  • Lettuce version(s): 6.4.0.RELEASE
  • Redis version: 7+

Additional Context

The same support is present for ClusterScanCursor. It maintains a nodeId and routes the next command to the same node. Any reason for not adding this in Master-replica setup?

Dishant18 avatar May 06 '25 03:05 Dishant18

Hey @Dishant18 , please allow for the team to catch up with some work after the holidays and we will get back to you.

tishun avatar May 07 '25 07:05 tishun

@mp911de not contributing anymore? If it's an oversight, it's a bug which you only can quickly ack and we (me/ @Dishant18 ) might be able to submit a fix. Less effort for the new team that maintains this and quicker tat for us.

We are writing hacky code at our end to get around this behaviour.

shikharid avatar May 07 '25 08:05 shikharid

@mp911de is always welcome to contribute, but I see he is quite busy with the Spring backlog.

If you and your team have a suggestion how to fix this you are more than welcome to submit a PR. I'd still have to go over it and verify your solution though, but it shouldn't take much time.

In any case, even if you do not come up with a PR, the team would get to this issue ASAP and let you know how we think we should proceed.

tishun avatar May 07 '25 11:05 tishun

@tishun gentle nudge on this one : )

We are only waiting for you to acknowledge that this is a bug, and we'll submit a PR for this soon. TIA

Dishant18 avatar May 09 '25 16:05 Dishant18

@tishun gentle nudge again : ) Any updates here will be much appreciated. TIA

Dishant18 avatar May 26 '25 11:05 Dishant18

Hey, apologies for that, I plan to spend some time for this this Friday.

tishun avatar May 26 '25 12:05 tishun

@tishun did you get a chance to look at this?

Dishant18 avatar Jun 01 '25 15:06 Dishant18

Hey, not enough, took a quick look, but I will spend more time this week. Sorry for the slowdown.

tishun avatar Jun 03 '25 06:06 tishun

Ideally I'd like to reproduce it locally so I can see what exactly happens

tishun avatar Jun 03 '25 06:06 tishun

There's sample code in the issue above which you can use to easily reproduce

  • setup a master/replica redis locally
  • add N elements to a redis hash
  • keep scanning with the above sample in a loop. If exactly N elements aren't read in all full scans, our issue is reproduced

shikharid avatar Jun 03 '25 12:06 shikharid

Reproduced the issue. The functionality to follow the node seems to be missing in many places right now. Will try to address the specific use case you've described first, as a complete solution would require a very large effort.

tishun avatar Jun 06 '25 14:06 tishun

Yeah, A special type of source aware cursor should solve it.

Similar to how it's managed in cluster connections.

shikharid avatar Jun 06 '25 14:06 shikharid

Spent some time working on this issue and unfortunately I do not currently have the bandwidth to complete it. A few pointers to why it is actually much harder to implement than initially anticipated:

  • @mp911de introduced with #201 a solution for the OSS cluster mode, that actually works in a different manner than the one expected in this issue. The ClusterScanCursor - and more specifically the whole ClusterScanSupport - works to aggregate the result from all nodes of the cluster, whereas we here need to only do make the cursor stick to one node
  • the infrastructure that exists in the cluster hierarchy is completely different than the one in the master-replica and we have no easy way to associate a cursor with a connection right now

I have a draft saved of some partial changes, but TBH Sentinel work is not high on my list of priorities right now. If the community wants to chip in I will provide all the help I can.

tishun avatar Jun 09 '25 13:06 tishun

Any reason for not adding this in Master-replica setup?

Master Replica has been introduced as additional variant to enable replicas through the Commands API instead of requiring users to maintain their own replica connection. It comes with its very own set of limitations that are necessary because of the way connections are abstracted away behind the commands API.

Taking a step back, the design is rooted in the fact that Redis used to be a connection-oriented approach. With the evolution and adding Sentinel, Cluster, and other means to run Redis, clients would need a paradigm shift to think about Redis as pool of servers and access Redis through a session interface that hides how Redis is actually operated. Design of certain features (e.g. transaction) makes it difficult to evolve in such a design. A Session-oriented view that doesn't put a Connection into its center but rather an API to run commands could help to provide certain features on top of unified infrastructure.

mp911de avatar Jun 10 '25 09:06 mp911de