redis-py Fix get_node_from_slot to handle resharding

Pull Request check-list

Please make sure to review and check all of these items:

[x] Do tests and lints pass with this change?
[x] Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
[x] Is the new or changed code fully tested?
[ ] Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
[ ] Is there an example added to the examples folder (if applicable)?
[x] Was the change added to CHANGES file?

NOTE: these things are not required to open a PR and can be done afterwards / while the PR is open.

Description of change

~Introduce a new Enum and optional flag value to allow reading only from replicas, if the command supports it. It appears the sync version of RedisCluster used to support this when server_type was passed in to get_node_from_slot, but that parameter isn't set anywhere.~

This PR addresses the issue of reading from replicas while resharding, which can cause index failures. It is an alternative, and more comprehensive solution to https://github.com/redis/redis-py/pull/2989, but in both sync and asyncio implementations. (https://github.com/redis/redis-py/issues/2988).

This implementation moves the logic to a shared location between the asyncio and the sync versions of the library. I have a follow on PR to introduce additional read-only modes that was initially part of this PR, but has been kept separate to hopefully increase the likelihood that this PR can get merged.

Before the change, we stored the next replica to read from in the primary_to_idx cache. After the change we store the last read replica/primary. This is important for the following example:

Server exists with a slot with 1 primary and 2 replicas.
A number of commands are executed. The last command executes on index 1, setting primary_to_idx to value 2.
Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now 2.
get_node_from_slot reads the next-node value as 2, and returns that (and sets the next-node value to 1, due to 2+1%2 = 1, skipping the primary on the next run)
We try to read the Node at slot_nodes[2], which doesn't exist, and gives us an index exception.

After, we only store the last-read, not the next-value, so the situation would unfold as follows:

Server exists with a slot with 1 primary and 2 replicas.
A number of commands are executed. The last command executes on index 1, setting primary_name_to_last_used_index to value 1.
Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now 2.
get_node_from_slot reads the last-read-node value as 1, increments, and modulos against length of 2, resulting in the value of 0 being stored in primary_name_to_last_used_index and returned
We try to read the Node at slot_nodes[0], the primary, as expected, and everything works.

Fixes https://github.com/redis/redis-py/issues/2988

Mar 13 '24 19:03 johan-seesaw

@gerzse I was wondering if there might be any bandwidth to review this PR?

May 28 '24 19:05 johan-seesaw

Hi @gerzse I've added UTs, a change test, and fixed a bug. My initial commit was intended to solicit feedback before performing those actions, but I figured I'd go ahead and complete them in hope this change might make it in.

I've structured two commits that you can see, and eventually squash into one before merging. It demonstrates the problem area here and here (notably, two separate exception types for effectively the same issue).

The actual-fix commit eliminates the cause demonstrated by those tests.

Jun 21 '24 00:06 johan-seesaw

More related issues.

https://github.com/redis/redis-py/issues/3238 https://github.com/redis/redis-py/pull/2575

Jul 16 '24 21:07 johan-seesaw

@johan-seesaw How did you fix it in the application as this is taking time to get merged?

Mar 26 '25 08:03 noorul