milvus-cdc [Bug]: Duplicate data when sync data from milvus upstream to downstream

Current Behavior

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

Oct 18 '24 09:10 anhnch30820

What version of milvus is used? What version of cdc is used? Is there high concurrent insert/delete?

Oct 18 '24 09:10 SimFG

@SimFG I use milvus version 2.4.13, cdc version v2.0.0-rc2. TPS is 2 My flow here:

source milvus create collectionA and insert data create index, flush
backup collection A, and restore CollectionA on target milvus While insert/delete data to source milvus
set target milvus's ttMsgEnabled to false
create cdc task use collectionA's backup positions
Stop insert/delete data to source milvus
set target milvus's ttMsgEnabled to true
Check data in GUI Attu

Oct 18 '24 10:10 anhnch30820

I just tested again only add data, no delete data and I see, total data in source milvus: 987 total data in checkpoint backup: 492 total data in target milvus: 987 + 492 = 1479

Oct 18 '24 10:10 anhnch30820

@anhnch30820 you can try to use the cdc server in the latest main branch.

Oct 18 '24 10:10 SimFG

@SimFG I tried using cdc server in lastest main branch and I got error when I create task

[INFO] [reader/etcd_op.go:566] ["get all collection data"] [count=2]
[INFO] [reader/replicate_channel_manager.go:162] ["has added dropped collection"] [ids="[]"]
[2024/10/22 04:54:20.822 +00:00] [INFO] [reader/collection_reader.go:241] ["the collection is not in the watch list"] [task_id=1af9cdba993148c69a6162f49040642b] [name=vdsmm] [collection_id=453395456371199765]
[2024/10/22 04:54:20.822 +00:00] [INFO] [reader/collection_reader.go:241] ["the collection is not in the watch list"] [task_id=1af9cdba993148c69a6162f49040642b] [name=vdsmb] [collection_id=453395456372200058]
[2024/10/22 04:54:20.822 +00:00] [DEBUG] [[email protected]/call.go:35] ["retrying of unary invoker"] [target=etcd-endpoints://0xc000841dc0/milvus-etcd:2379] [attempt=0]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:710] ["get all partition data"] [partition_num=2]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:742] ["partition state is not created/dropped or partition name is default"] [partition_name=_default] [state=PartitionCreated]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:742] ["partition state is not created/dropped or partition name is default"] [partition_name=_default] [state=PartitionCreated]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/collection_reader.go:319] ["has started to read collection and partition"] [task_id=1af9cdba993148c69a6162f49040642b]
[2024/10/22 04:54:20.824 +00:00] [INFO] [server/cdc_impl.go:332] ["create request done"]

Oct 22 '24 04:10 anhnch30820

It seems that this has correctly processed the create request

Oct 22 '24 06:10 SimFG

@SimFG But when I created collection, nothing changes in the target cluster

[2024/10/22 07:06:12.594 +00:00] [INFO] [reader/etcd_op.go:251] ["the collection state is not created"] [key=by-dev/meta/root-coord/database/collection-info/1/453395456372882628] [collection_name=vdsmb] [state=CollectionCreating]
[2024/10/22 07:06:13.680 +00:00] [INFO] [reader/etcd_op.go:389] ["partition state is not created or partition name is default"] [collection_id=453395456372882628] ["partition name"=_default] [state=PartitionCreated]
[2024/10/22 07:06:15.941 +00:00] [DEBUG] [[email protected]/call.go:35] ["retrying of unary invoker"] [target=etcd-endpoints://0xc0009e8700/milvus-etcd:2379] [attempt=0]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/collection_reader.go:117] ["has watched to read collection"] [task_id=c72583aafca1470a9d8d04330f77445a] [collection_name=vdsmb] [collection_id=453395456372882628]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/collection_reader.go:120] ["the collection should not be read"] [task_id=c72583aafca1470a9d8d04330f77445a] [collection_name=vdsmb] [collection_id=453395456372882628]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/etcd_op.go:284] ["the collection is not consumed"] [collection_id=453395456372882628] [collection_name=vdsmb]

Oct 22 '24 07:10 anhnch30820

From the log, the collection in source milvus has not been created yet, because its state is creating. However, I suspect that this problem is caused by the previous data residue. To ensure correctness, I suggest cleaning up all environmental data first, such as the meta storage information of cdc, and then redeploy the two milvus and cdc services.

Oct 22 '24 08:10 SimFG

@SimFG I tried again, it still duplicated

Oct 23 '24 09:10 anhnch30820

How do you do it? Is it the following steps: insert data first, then delete data, and then use attu to check the number of data rows. Do you wait for a while before checking the number of rows? It may be because the deleted data may not have been applied yet. If you don't want to wait for a while, you can try using flush.

Can you find out the diff data and whether some delete operations have not taken effect.

Oct 23 '24 09:10 SimFG

Each PR is guaranteed by integration testing, and there will be CDC process testing every day. In theory, such a small amount of data should be unlikely to go wrong.

Oct 23 '24 09:10 SimFG

@SimFG Here is the upstream dc (1)

And here is the downstream dr (1)

Oct 23 '24 10:10 anhnch30820

@SimFG Could you provide me the latest file bin milvus-cdc?

Oct 23 '24 11:10 anhnch30820

you can clone the repo, and in the repo dir, execute the command: make build

Oct 23 '24 11:10 SimFG

Can you confirm whether the two milvus are completely independent? I feel that the downstream milvus seems to be abnormal. The extra data seems to be the data of one segment being repeatedly calculated on another segment.

318 = 169+149

Oct 23 '24 11:10 SimFG

@SimFG 318 from milvus cdc 169 from milvus backup restore It seems to get all the data from the beginning and not from the checkpoint.

Oct 23 '24 11:10 anhnch30820

@anhnch30820 See if the point is not set correctly. You can try not to use the point first to see if the cdc can work properly.

Oct 23 '24 12:10 SimFG

@SimFG Not work with large data

Oct 24 '24 10:10 anhnch30820

In reality most pages have only 8 lines of data, but the results from 631 to 644 should be 14 lines in the downstream attu And I checked the total amount of data with code and attu and the results are the same in the downstream, it should be 100999 like upstream total_entities

Oct 24 '24 10:10 anhnch30820

I created a backup and compared their total capacity, dc is upstream, dr is downstream. the result shows that dr cluster has almost 2 times the capacity compare_dc_dr

Oct 24 '24 11:10 anhnch30820

This test is to see if the position parameter is passed in when creating the task. In addition, the performance of attu seems to be caused by duplicate data. Recently, I am developing a data difference checking tool.

Oct 24 '24 11:10 SimFG