milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: handoffTask: handoff segment failed

Open p363796822 opened this issue 2 years ago • 16 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: v2.1.0
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 8c64G
- GPU: None
- Others:

Current Behavior

After redeploying the new version v2.1.0, based on the original data, there are a total of five collections, two of which can be queried, and the other three collections will report a timeout of 60s after querying. Check the docker log to find that: milvus-querycoord | 2022-08-01T02:20:25.728765759Z [2022/08/01 02:20:25.728 +00:00] [WARN] [querycoord/handoff_handler.go:375] ["HandoffHandler: handoff task failed to execute"] [segmentID=434263383828856835] [error=", sealed segment has been exist on query node, segmentID is 434263383828856835"] milvus-querycoord | 2022-08-01T02:20:25.729036877Z [2022/08/01 02:20:25.728 +00:00] [INFO] [querycoord/task_scheduler.go:766] ["scheduleLoop: trigger task done and delete from etcd"] [triggerTaskID=434894998182083023] milvus-querycoord | 2022-08-01T02:20:25.729050793Z [2022/08/01 02:20:25.728 +00:00] [INFO] [querycoord/task_scheduler.go:669] ["scheduleLoop: pop a triggerTask from triggerTaskQueue"] [triggerTaskID=434894998182083024] milvus-querycoord | 2022-08-01T02:20:25.729057513Z [2022/08/01 02:20:25.729 +00:00] [INFO] [querycoord/task_scheduler.go:497] ["begin to process task"] [taskID=434894998182083024] [task=*querycoord.handoffTask] milvus-querycoord | 2022-08-01T02:20:25.729085229Z [2022/08/01 02:20:25.729 +00:00] [INFO] [querycoord/task.go:1742] ["start do handoff segments task"] [segmentIDs="[433153815999676417]"] milvus-querycoord | 2022-08-01T02:20:25.730722216Z [2022/08/01 02:20:25.729 +00:00] [ERROR] [querycoord/task.go:1862] ["handoffTask: handoff segment failed"] [segmentID=433153815999676417] [error="sealed segment has been exist on query node, segmentID is 433153815999676417"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*handoffTask).execute\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:1862\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:587\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:672"]

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

p363796822 avatar Aug 01 '22 02:08 p363796822

/assign @congqixia please help on invetigating it

xiaofan-luan avatar Aug 01 '22 10:08 xiaofan-luan

Hi @p363796822 , could you please provided the log of QueryCoord, QueryNode and Proxy? You could find some scripts to export logs from this link: https://github.com/milvus-io/milvus/tree/master/deployments/export-log

congqixia avatar Aug 01 '22 10:08 congqixia

Hi @p363796822 , could you please provided the log of QueryCoord, QueryNode and Proxy? You could find some scripts to export logs from this link: https://github.com/milvus-io/milvus/tree/master/deployments/export-log

This is the network disk address of the log: Link: https://pan.baidu.com/s/11yfQFW4SMYUO0fqozwPtgA Extraction code: yx3t After copying this content, open the Baidu SkyDrive mobile app, the operation is more convenient

p363796822 avatar Aug 02 '22 01:08 p363796822

I have the same problem,Appeared after upgrade from previous Milvus-dev dockerhub

lvmnn avatar Aug 02 '22 08:08 lvmnn

Hi @p363796822 , could you please provided the log of QueryCoord, QueryNode and Proxy? You could find some scripts to export logs from this link: https://github.com/milvus-io/milvus/tree/master/deployments/export-log

Are there any new developments?

p363796822 avatar Aug 05 '22 03:08 p363796822

Here is some preliminary conclusion for this problem:

Recovery is blocked by previous tasks

It could be observed in the log that the first LoadBalanceTask for NodeDown happens in 2022/07/28 08:27:47.938 while querycoord started at 2022/07/28 07:19:10.951

milvus-querycoord  | [2022/07/28 08:27:47.938 +00:00] [INFO] [querycoord/task_scheduler.go:497] ["begin to process task"] [taskID=434894998181773322] [task=*querycoord.loadBalanceTask]
milvus-querycoord  | [2022/07/28 08:27:47.938 +00:00] [INFO] [querycoord/task.go:1920] ["start do loadBalanceTask"] ["trigger type"=4] [sourceNodeIDs="[692]"] [balanceReason=NodeDown] [taskID=434894998181773322]
[2022/07/28 07:19:10.951 +00:00] [INFO] [roles/roles.go:358] ["starting running Milvus components"]

taskScheduler was handling the tasks recovered from etcd before upgrading the cluster.

milvus-querycoord  | [2022/07/28 07:19:20.721 +00:00] [INFO] [querycoord/task_scheduler.go:669] ["scheduleLoop: pop a triggerTask from triggerTaskQueue"] [triggerTaskID=434891485372219518]
milvus-querycoord  | [2022/07/28 07:19:20.721 +00:00] [DEBUG] [querycoord/task_scheduler.go:617] ["scheduleLoop: num of child task"] ["num child task"=15] ["trigger task ID"=434891485372219518]
...
milvus-querycoord  | [2022/07/28 08:27:47.938 +00:00] [INFO] [querycoord/task_scheduler.go:766] ["scheduleLoop: trigger task done and delete from etcd"] [triggerTaskID=434891485372219518]

This long task ends at the very moment when querycoord started to handle nodeDown event.

In branch 2.1.0, the type of recovered task does not print in the log. We could reason that this trigger task was a loadBalanceTask from this log

milvus-querycoord  | [2022/07/28 08:27:47.929 +00:00] [ERROR] [querycoord/task.go:2501] ["loadBalanceTask: failed to sync segments distribution"] [collectionID=0] [replicaID=0] [error="syncReplicaSegments: can't find leader query node, leaderID = 692"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*loadBalanceTask).globalPostExecute\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:2501\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:735"]

This triggerTask may be the one before upgrade the milvus instance when querynode went down first and querycoord handled this event but failed to finish it.

Slow loadBalanceTask

The long run time was abnormal in the system. We met this issue before: #17900 Related PR: #17901 #17906

So I tried to find proof in the log:

qn2.log:milvus-querynode2  | [2022/07/28 07:20:54.194 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=432856225019330561] [nodeID=707]
qn2.log:milvus-querynode2  | [2022/07/28 07:20:54.195 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=432856224953532417] [nodeID=707]
qn2.log:milvus-querynode2  | [2022/07/28 07:20:54.195 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=432856225110818817] [nodeID=707]
qn2.log:milvus-querynode2  | [2022/07/28 07:20:54.196 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=433802452728283137] [nodeID=707]
qn2.log:milvus-querynode2  | [2022/07/28 07:50:58.430 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=432856225019330561] [nodeID=707]
qn2.log:milvus-querynode2  | [2022/07/28 08:09:07.565 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=432856224953532417] [nodeID=707]
qn2.log:milvus-querynode2  | [2022/07/28 08:27:38.336 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=432856225110818817] [nodeID=707]
qn2.log:milvus-querynode2  | [2022/07/28 08:27:47.851 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=433802452728283137] [nodeID=707]

Bingo, watchDeltaTask worked very slow.

Checkpoint ts:

milvus-querynode2  | [2022/07/28 07:20:54.219 +00:00] [INFO] [querynode/segment_loader.go:633] ["from dml check point load delete"] [position="channel_name:\"by-dev-rootcoord-dml_3_432856225019330561v1\" msgID:\"\\010\\307\\254\\004\\020\\366\\242\\002\\030\\000 \\000\" msgGroup:\"by-dev-dataNode-467-432856225019330561\" timestamp:434420490202775553 "] ["msg id"=CMesBBD2ogIYACAA]

Parse this timestamp:

"physical time"=2022/07/07 16:30:56.108 +08:00

This IS caused by a remote checkpoint problem

congqixia avatar Aug 05 '22 07:08 congqixia

This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2

congqixia avatar Aug 05 '22 07:08 congqixia

Also, we could observe that recovered LoadBalanceTask may not make sense in some cases.

congqixia avatar Aug 05 '22 07:08 congqixia

This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2

It means that before my Milvus is upgraded from 2.0.2 to 2.1.0, I release all the collections in the 2.0.2 version, and after the upgrade, do the collection load on the 2.1.0 version?

p363796822 avatar Aug 05 '22 10:08 p363796822

This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2

It means that before my Milvus is upgraded from 2.0.2 to 2.1.0, I release all the collections in the 2.0.2 version, and after the upgrade, do the collection load on the 2.1.0 version? Then I will roll back the version now, after I release all the collections from 2.1.0 to 2.0.2, and then load all the collections after I upgrade from 2.0.2 to 2.1.0, is that right?

p363796822 avatar Aug 05 '22 10:08 p363796822

This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2

The latest docker corresponding to these two prs, milvusdb/milvus-dev:2.1.0-20220808-9899a5be has not solved the problem, and the current state is that there is no way to connect after startup, is there any way to restore by reading minio How about inserting data? I decompile the original data here and then redeploy and insert it.

p363796822 avatar Aug 09 '22 03:08 p363796822

@p363796822 any chance that you could retry on a completed new delpoyment?

yanliang567 avatar Aug 11 '22 06:08 yanliang567

@p363796822 any chance that you could retry on a completed new delpoyment?

I have tried it, using the latest milvusdb/milvus-dev: 2.1.0-20220808-9899a5be, but it still does not solve my problem,I have deleted the previous data and redeployed milvusdb/milvus:v2.1.0, and found a problem. When importing data in batches, the memory will explode.

p363796822 avatar Aug 11 '22 06:08 p363796822

I'm seeing this creating an IVF_SQ8 index with the milvusdb/milvus:master-20220906-d6b0fdc image. Milvus uses no CPU, pretty sure it's not making any actual progress. I'll see if can make a repro.

mausch avatar Sep 06 '22 11:09 mausch

@congqixia is this the same issue(reinstall and upgrade) that you are looking into ?

yanliang567 avatar Sep 06 '22 11:09 yanliang567

@p363796822 Milvus 2.1.2 is released. You can this new version to check whether your problem is solved.

JackLCL avatar Sep 21 '22 09:09 JackLCL

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Oct 31 '22 02:10 stale[bot]