valkey icon indicating copy to clipboard operation
valkey copied to clipboard

SCAN, SSCAN, HSCAN, and ZSCAN are not stable across failovers

Open madolson opened this issue 1 year ago • 3 comments

Scan is supposed to provide the following guarantees (as per https://redis.io/docs/manual/keyspace/):

A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration. This means that if a given element is inside the collection when an iteration is started, and is still there when an iteration terminates, then at some point SCAN returned it to the user.

A full iteration never returns any element that was NOT present in the collection from the start to the end of a full iteration. So if an element was removed before the start of an iteration, and is never added back to the collection for all the time an iteration lasts, SCAN ensures that this element will never be returned.

While playing around with some PoC for cluster wide scan, I realized that the first guarantee can only hold as long as you were connected to the same node during the entire duration of the scanning. If failover occurs, the replica may have a different seed value for the siphash function, so the cursor previously used on the primary would not be in the same place. This is likely less of an issue for SCAN since you normally indicate the node you're talking to, but many CME client transparently handle re-directs for SSCAN/HSCAN/ZSCAN during failovers.

I'm not sure we need to strictly do anything about SCAN, I'm not sure how often SCAN is resumed after a failover. The other commands might though.

I think this is worth addressing, but could be done in three ways:

Simply update the documentation to add the caveat. This doesn't feel right to me, because most users will not really be aware of this. Add a configuration so that seeds can be set externally. This would allow operators to configure this consistency, but has limited benefit for those that don't know. Allow replicas to sync their data seed from their primaries. This makes their cursors consistent. We can also persistent this into RDB so that it is still accurate, I'm most in-favor of this.

ref: https://github.com/redis/redis/issues/12440

We never came up with a cohesive answer here

madolson avatar Mar 22 '24 05:03 madolson

There is a second issue that I want to make sure we don't drop, which is the cursor format is changing for SCAN for Redis 8 with the new per-slot dictionaries. The format is changing from <64 bits for DB cursor> to 00<14 bits for slot><48 bits for DB cursor>. To be specific, we aren't actually showing the bits, just the integer representation of those bits. We should be able to detect versioning issues, i.e. using the cursor from 7.0 node on an 8.0 node, so I would like to propose we update it to one of the following two options:

<2 version bits><14 bits for slot><48 bits for DB cursor>, we will bump from version 00 -> 01. This is likely the most backwards compatible. It also allows us in the future to do a third version if we want to re-organize the cursor bits. Going to more than 2 bits introduces the risk there are users are storing the cursor as a long long, and it will break. -<14 bits for slot><48 bits for DB cursor>. The version will initially be 1, and we'll use the - to detect the new format. This gives us the most freedom to change the version more in the future. @yossigo, I wasn't able to find our decision from the previous meeting. I know you were concerned with the format, which we agreed we should look into fixing, but I don't recall if you also wanted to try to make the SCAN command stable across failover. (I still want to try to make HSCAN, ZSCAN, and SSCAN stable) So that it might return duplicate items, but won't omit items because of the cursor shift.

madolson avatar Mar 22 '24 05:03 madolson

This is a great topic to finish. So it's actually three things mentioned here:

  1. Scan across failovers. Synching the hashtable seed. We can do that in the RDB or in another way when initiating replication or before. The seed may be sensitive security information though, but if anyone has permission to replicate, I think it's fine to let them fetch the seed too.
  2. Cluster-wide scan. Per slot, either explicitly or implicitly. You were against exposing the slot to the user, while others liked it.
  3. Cursor versioning. I don't see how you can use these bits though. Are these two LSB unused in an old cursor?

zuiderkwast avatar Mar 24 '24 00:03 zuiderkwast