Duplicate data was generated during expansion.
Search before asking
- [x] I had searched in the issues and found no similar issues.
Version
2.10
Minimal reproduce step
Env: Kvrocks cluster with 4 shards.
- Using go-redis and opening multiple goroutines to concurrently generate data continuously.
userId := helper.GetRandomInt64InRange(1, 10000)
key := strconv.FormatInt(userId, 10)
uuidStr := uuid.New().String()
timestamp := time.Now().UnixNano()
value := fmt.Sprintf("%s-%d", uuidStr, timestamp)
_, err := c.Do(context.TODO(), 2, "LPUSH", key, value)
- Use the following command to expand the Kvrocks cluster from 4 shards to 8 shards.
CLUSTERX SETNODEID $NODE_ID $NODE_ID
CLUSTERX MIGRATE $slot $dst_nodeid
CLUSTERX SETSLOT $slot NODE $node_id $new_version
- Check data after expand.
What did you expect to see?
no duplicate data was generated.
What did you see instead?
[root@iZwz9dff7pa0adxvbwkbc8Z ~]# ./check_redis_list
2025/01/16 13:34:20 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
LIST 1978 has duplicate data:
duplicate data: 99b22bda-96de-43ec-98d4-d74d5a590d11-1736994529584945168
duplicate data index: [43 44]
list length: 395
LIST 3171 has duplicate data:
duplicate data: 9bb7a7cc-3004-4d5b-965d-fe171badc21a-1736994509584166314
duplicate data index: [86 87]
list length: 412
LIST 3468 has duplicate data:
duplicate data: 9c3bc47b-d92c-4d55-b588-9fbc594849fc-1736994519583905106
duplicate data index: [65 66]
list length: 406
LIST 3616 has duplicate data:
duplicate data: 41342d98-ace0-4212-b9ea-437772a17e68-1736994539585064098
duplicate data index: [32 33]
list length: 376
LIST 5519 has duplicate data:
duplicate data: c9fb599e-453e-4615-bc7c-74b1b8a8bec0-1736994529583967377
duplicate data index: [51 52]
list length: 414
LIST 6445 has duplicate data:
duplicate data: 3d1c3404-af08-4d47-a0f8-2fa0cea0885a-1736994509583942704
duplicate data index: [87 88]
list length: 377
LIST 8973 has duplicate data:
duplicate data: ab937bce-f43f-4f2c-a65a-b1c6cc5f08b9-1736994529584052504
duplicate data index: [55 56]
list length: 414
LIST 9250 has duplicate data:
duplicate data: 1550b5fc-ba5a-4d88-9b9d-daf9c52650ef-1736994509583124277
duplicate data index: [82 83]
list length: 421
LIST 9555 has duplicate data:
duplicate data: 606ab92e-4052-43fb-900d-91e45892ef2a-1736994529583492941
duplicate data index: [59 60]
list length: 395
Anything Else?
No response
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
@VIVALXH Great! thanks for your detailed information, it's helpful to identify this isue.
Hi @VIVALXH
go-redis might retry on network error. Would you mind disabling the retry to see if the issue is still existing? https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/options.go#L75C2-L75C12
@git-hulk thanks.
No duplicate data was generated, but this resulted in the loss of a significant amount of data that should have been written.
I controlled the osscluster retry count by setting redis.UniversalOptions.MaxRedirects = -1. (https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/osscluster.go#L45)
However, this inevitably led to Moved errors and TryAgain errors as mentioned in this PR(https://github.com/apache/kvrocks/pull/1240).
The Moved error occurs between the following two commands:
- CLUSTERX MIGRATE $slot $dst_nodeid
- CLUSTERX SETSLOT $slot NODE $node_id $new_version
Additionally, if the interval between these two commands is longer (e.g., 10 seconds), the probability of generating duplicate data increases. The duplicate data primarily appears on the new nodes after scaling out. Following the process I described, this issue can be consistently reproduced.
I controlled the osscluster retry count by setting redis.UniversalOptions.MaxRedirects = -1.
@VIVALXH You should set MaxRetries to -1 instead of MaxRedirects. Or it will prevent retrying while receiving the TRYAGAIN error.
@git-hulk
I am using ClusterClient, and internally it uses MaxRedirects instead of MaxRetries in its process function. It worked as expected and achieved our goal of preventing duplicate data. https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/osscluster.go#L967
This suggests that retries might have caused the data duplication, but in reality, we cannot set the retries to zero because it lost amount of data.
@VIVALXH The cluster client also supports the MaxRetries, set MaxRedirects=-1 wouldn't retry when occuring the TRYGAIN error.
Hi @VIVALXH
go-redis might retry on network error. Would you mind disabling the retry to see if the issue is still existing? https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/options.go#L75C2-L75C12
@git-hulk Now we conclude that the retry may have caused the duplicate data, but in practice it is impossible to cancel the retry.
@VIVALXH The cluster client also supports the MaxRetries, set MaxRedirects=-1 wouldn't retry when occuring the TRYGAIN error.
@git-hulk
Hello, I tried setting MaxRetries=-1 today, but duplicate data still occurred. Only when I set MaxRedirects=-1 did the duplicate data stop.
So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?
So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?
I'm not quite sure about this. But after considering this issue, it's impossible to push an element exactly once in a list without the deduplicate mechanism. Because we always need to retry once there are any network issues or timeout, or it might lose data.
So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?
I'm not quite sure about this. But after considering this issue, it's impossible to push an element exactly once in a list without the deduplicate mechanism. Because we always need to retry once there are any network issues or timeout, or it might lose data.
yes, it is impossible to cancel the retry because it lost amount of data.
At present, the situation of duplicate data only occurs when scaling. Is there any other solution that can avoid this?