Fix get_node_from_slot to handle resharding
Pull Request check-list
Please make sure to review and check all of these items:
- [x] Do tests and lints pass with this change?
- [x] Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
- [x] Is the new or changed code fully tested?
- [ ] Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
- [ ] Is there an example added to the examples folder (if applicable)?
- [x] Was the change added to CHANGES file?
NOTE: these things are not required to open a PR and can be done afterwards / while the PR is open.
Description of change
~Introduce a new Enum and optional flag value to allow reading only from replicas, if the command supports it. It appears the sync version of RedisCluster used to support this when server_type was passed in to get_node_from_slot, but that parameter isn't set anywhere.~
This PR addresses the issue of reading from replicas while resharding, which can cause index failures. It is an alternative, and more comprehensive solution to https://github.com/redis/redis-py/pull/2989, but in both sync and asyncio implementations. (https://github.com/redis/redis-py/issues/2988).
This implementation moves the logic to a shared location between the asyncio and the sync versions of the library. I have a follow on PR to introduce additional read-only modes that was initially part of this PR, but has been kept separate to hopefully increase the likelihood that this PR can get merged.
Before the change, we stored the next replica to read from in the primary_to_idx cache. After the change we store the last read replica/primary. This is important for the following example:
- Server exists with a slot with 1 primary and 2 replicas.
- A number of commands are executed. The last command executes on index
1, settingprimary_to_idxto value 2. - Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now
2. get_node_from_slotreads the next-node value as2, and returns that (and sets the next-node value to1, due to2+1%2 = 1, skipping the primary on the next run)- We try to read the Node at
slot_nodes[2], which doesn't exist, and gives us an index exception.
After, we only store the last-read, not the next-value, so the situation would unfold as follows:
- Server exists with a slot with 1 primary and 2 replicas.
- A number of commands are executed. The last command executes on index
1, settingprimary_name_to_last_used_indexto value 1. - Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now
2. get_node_from_slotreads the last-read-node value as1, increments, and modulos against length of 2, resulting in the value of0being stored inprimary_name_to_last_used_indexand returned- We try to read the Node at
slot_nodes[0], the primary, as expected, and everything works.
Fixes https://github.com/redis/redis-py/issues/2988
@gerzse I was wondering if there might be any bandwidth to review this PR?
Hi @gerzse I've added UTs, a change test, and fixed a bug. My initial commit was intended to solicit feedback before performing those actions, but I figured I'd go ahead and complete them in hope this change might make it in.
I've structured two commits that you can see, and eventually squash into one before merging. It demonstrates the problem area here and here (notably, two separate exception types for effectively the same issue).
The actual-fix commit eliminates the cause demonstrated by those tests.
More related issues.
https://github.com/redis/redis-py/issues/3238 https://github.com/redis/redis-py/pull/2575
@johan-seesaw How did you fix it in the application as this is taking time to get merged?