[Bug]: Duplicate data when sync data from milvus upstream to downstream
Current Behavior
Expected Behavior
No response
Steps To Reproduce
No response
Environment
No response
Anything else?
No response
What version of milvus is used? What version of cdc is used? Is there high concurrent insert/delete?
@SimFG I use milvus version 2.4.13, cdc version v2.0.0-rc2. TPS is 2 My flow here:
- source milvus create collectionA and insert data create index, flush
- backup collection A, and restore CollectionA on target milvus While insert/delete data to source milvus
- set target milvus's ttMsgEnabled to false
- create cdc task use collectionA's backup positions
- Stop insert/delete data to source milvus
- set target milvus's ttMsgEnabled to true
- Check data in GUI Attu
I just tested again only add data, no delete data and I see, total data in source milvus: 987 total data in checkpoint backup: 492 total data in target milvus: 987 + 492 = 1479
@anhnch30820 you can try to use the cdc server in the latest main branch.
@SimFG I tried using cdc server in lastest main branch and I got error when I create task
[INFO] [reader/etcd_op.go:566] ["get all collection data"] [count=2]
[INFO] [reader/replicate_channel_manager.go:162] ["has added dropped collection"] [ids="[]"]
[2024/10/22 04:54:20.822 +00:00] [INFO] [reader/collection_reader.go:241] ["the collection is not in the watch list"] [task_id=1af9cdba993148c69a6162f49040642b] [name=vdsmm] [collection_id=453395456371199765]
[2024/10/22 04:54:20.822 +00:00] [INFO] [reader/collection_reader.go:241] ["the collection is not in the watch list"] [task_id=1af9cdba993148c69a6162f49040642b] [name=vdsmb] [collection_id=453395456372200058]
[2024/10/22 04:54:20.822 +00:00] [DEBUG] [[email protected]/call.go:35] ["retrying of unary invoker"] [target=etcd-endpoints://0xc000841dc0/milvus-etcd:2379] [attempt=0]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:710] ["get all partition data"] [partition_num=2]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:742] ["partition state is not created/dropped or partition name is default"] [partition_name=_default] [state=PartitionCreated]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:742] ["partition state is not created/dropped or partition name is default"] [partition_name=_default] [state=PartitionCreated]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/collection_reader.go:319] ["has started to read collection and partition"] [task_id=1af9cdba993148c69a6162f49040642b]
[2024/10/22 04:54:20.824 +00:00] [INFO] [server/cdc_impl.go:332] ["create request done"]
It seems that this has correctly processed the create request
@SimFG But when I created collection, nothing changes in the target cluster
[2024/10/22 07:06:12.594 +00:00] [INFO] [reader/etcd_op.go:251] ["the collection state is not created"] [key=by-dev/meta/root-coord/database/collection-info/1/453395456372882628] [collection_name=vdsmb] [state=CollectionCreating]
[2024/10/22 07:06:13.680 +00:00] [INFO] [reader/etcd_op.go:389] ["partition state is not created or partition name is default"] [collection_id=453395456372882628] ["partition name"=_default] [state=PartitionCreated]
[2024/10/22 07:06:15.941 +00:00] [DEBUG] [[email protected]/call.go:35] ["retrying of unary invoker"] [target=etcd-endpoints://0xc0009e8700/milvus-etcd:2379] [attempt=0]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/collection_reader.go:117] ["has watched to read collection"] [task_id=c72583aafca1470a9d8d04330f77445a] [collection_name=vdsmb] [collection_id=453395456372882628]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/collection_reader.go:120] ["the collection should not be read"] [task_id=c72583aafca1470a9d8d04330f77445a] [collection_name=vdsmb] [collection_id=453395456372882628]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/etcd_op.go:284] ["the collection is not consumed"] [collection_id=453395456372882628] [collection_name=vdsmb]
From the log, the collection in source milvus has not been created yet, because its state is creating. However, I suspect that this problem is caused by the previous data residue. To ensure correctness, I suggest cleaning up all environmental data first, such as the meta storage information of cdc, and then redeploy the two milvus and cdc services.
@SimFG I tried again, it still duplicated
How do you do it? Is it the following steps: insert data first, then delete data, and then use attu to check the number of data rows. Do you wait for a while before checking the number of rows? It may be because the deleted data may not have been applied yet. If you don't want to wait for a while, you can try using flush.
Can you find out the diff data and whether some delete operations have not taken effect.
Each PR is guaranteed by integration testing, and there will be CDC process testing every day. In theory, such a small amount of data should be unlikely to go wrong.
@SimFG Here is the upstream
And here is the downstream
@SimFG Could you provide me the latest file bin milvus-cdc?
you can clone the repo, and in the repo dir, execute the command: make build
Can you confirm whether the two milvus are completely independent? I feel that the downstream milvus seems to be abnormal. The extra data seems to be the data of one segment being repeatedly calculated on another segment.
318 = 169+149
@SimFG 318 from milvus cdc 169 from milvus backup restore It seems to get all the data from the beginning and not from the checkpoint.
@anhnch30820 See if the point is not set correctly. You can try not to use the point first to see if the cdc can work properly.
@SimFG Not work with large data
In reality most pages have only 8 lines of data, but the results from 631 to 644 should be 14 lines in the downstream
And I checked the total amount of data with code and attu and the results are the same in the downstream, it should be 100999 like upstream
I created a backup and compared their total capacity, dc is upstream, dr is downstream. the result shows that dr cluster has almost 2 times the capacity
This test is to see if the position parameter is passed in when creating the task. In addition, the performance of attu seems to be caused by duplicate data. Recently, I am developing a data difference checking tool.