milvus
milvus copied to clipboard
[Bug]: handoffTask: handoff segment failed
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: v2.1.0
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 8c64G
- GPU: None
- Others:
Current Behavior
After redeploying the new version v2.1.0, based on the original data, there are a total of five collections, two of which can be queried, and the other three collections will report a timeout of 60s after querying. Check the docker log to find that: milvus-querycoord | 2022-08-01T02:20:25.728765759Z [2022/08/01 02:20:25.728 +00:00] [WARN] [querycoord/handoff_handler.go:375] ["HandoffHandler: handoff task failed to execute"] [segmentID=434263383828856835] [error=", sealed segment has been exist on query node, segmentID is 434263383828856835"] milvus-querycoord | 2022-08-01T02:20:25.729036877Z [2022/08/01 02:20:25.728 +00:00] [INFO] [querycoord/task_scheduler.go:766] ["scheduleLoop: trigger task done and delete from etcd"] [triggerTaskID=434894998182083023] milvus-querycoord | 2022-08-01T02:20:25.729050793Z [2022/08/01 02:20:25.728 +00:00] [INFO] [querycoord/task_scheduler.go:669] ["scheduleLoop: pop a triggerTask from triggerTaskQueue"] [triggerTaskID=434894998182083024] milvus-querycoord | 2022-08-01T02:20:25.729057513Z [2022/08/01 02:20:25.729 +00:00] [INFO] [querycoord/task_scheduler.go:497] ["begin to process task"] [taskID=434894998182083024] [task=*querycoord.handoffTask] milvus-querycoord | 2022-08-01T02:20:25.729085229Z [2022/08/01 02:20:25.729 +00:00] [INFO] [querycoord/task.go:1742] ["start do handoff segments task"] [segmentIDs="[433153815999676417]"] milvus-querycoord | 2022-08-01T02:20:25.730722216Z [2022/08/01 02:20:25.729 +00:00] [ERROR] [querycoord/task.go:1862] ["handoffTask: handoff segment failed"] [segmentID=433153815999676417] [error="sealed segment has been exist on query node, segmentID is 433153815999676417"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*handoffTask).execute\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:1862\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:587\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:672"]
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
/assign @congqixia please help on invetigating it
Hi @p363796822 , could you please provided the log of QueryCoord
, QueryNode
and Proxy
?
You could find some scripts to export logs from this link:
https://github.com/milvus-io/milvus/tree/master/deployments/export-log
Hi @p363796822 , could you please provided the log of
QueryCoord
,QueryNode
andProxy
? You could find some scripts to export logs from this link: https://github.com/milvus-io/milvus/tree/master/deployments/export-log
This is the network disk address of the log: Link: https://pan.baidu.com/s/11yfQFW4SMYUO0fqozwPtgA Extraction code: yx3t After copying this content, open the Baidu SkyDrive mobile app, the operation is more convenient
I have the same problem,Appeared after upgrade from previous Milvus-dev dockerhub
Hi @p363796822 , could you please provided the log of
QueryCoord
,QueryNode
andProxy
? You could find some scripts to export logs from this link: https://github.com/milvus-io/milvus/tree/master/deployments/export-log
Are there any new developments?
Here is some preliminary conclusion for this problem:
Recovery is blocked by previous tasks
It could be observed in the log that the first LoadBalanceTask
for NodeDown
happens in 2022/07/28 08:27:47.938
while querycoord started at 2022/07/28 07:19:10.951
milvus-querycoord | [2022/07/28 08:27:47.938 +00:00] [INFO] [querycoord/task_scheduler.go:497] ["begin to process task"] [taskID=434894998181773322] [task=*querycoord.loadBalanceTask]
milvus-querycoord | [2022/07/28 08:27:47.938 +00:00] [INFO] [querycoord/task.go:1920] ["start do loadBalanceTask"] ["trigger type"=4] [sourceNodeIDs="[692]"] [balanceReason=NodeDown] [taskID=434894998181773322]
[2022/07/28 07:19:10.951 +00:00] [INFO] [roles/roles.go:358] ["starting running Milvus components"]
taskScheduler
was handling the tasks recovered from etcd before upgrading the cluster.
milvus-querycoord | [2022/07/28 07:19:20.721 +00:00] [INFO] [querycoord/task_scheduler.go:669] ["scheduleLoop: pop a triggerTask from triggerTaskQueue"] [triggerTaskID=434891485372219518]
milvus-querycoord | [2022/07/28 07:19:20.721 +00:00] [DEBUG] [querycoord/task_scheduler.go:617] ["scheduleLoop: num of child task"] ["num child task"=15] ["trigger task ID"=434891485372219518]
...
milvus-querycoord | [2022/07/28 08:27:47.938 +00:00] [INFO] [querycoord/task_scheduler.go:766] ["scheduleLoop: trigger task done and delete from etcd"] [triggerTaskID=434891485372219518]
This long task ends at the very moment when querycoord started to handle nodeDown event.
In branch 2.1.0, the type of recovered task does not print in the log. We could reason that this trigger task was a loadBalanceTask from this log
milvus-querycoord | [2022/07/28 08:27:47.929 +00:00] [ERROR] [querycoord/task.go:2501] ["loadBalanceTask: failed to sync segments distribution"] [collectionID=0] [replicaID=0] [error="syncReplicaSegments: can't find leader query node, leaderID = 692"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*loadBalanceTask).globalPostExecute\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:2501\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:735"]
This triggerTask may be the one before upgrade the milvus instance when querynode went down first and querycoord handled this event but failed to finish it.
Slow loadBalanceTask
The long run time was abnormal in the system. We met this issue before: #17900 Related PR: #17901 #17906
So I tried to find proof in the log:
qn2.log:milvus-querynode2 | [2022/07/28 07:20:54.194 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=432856225019330561] [nodeID=707]
qn2.log:milvus-querynode2 | [2022/07/28 07:20:54.195 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=432856224953532417] [nodeID=707]
qn2.log:milvus-querynode2 | [2022/07/28 07:20:54.195 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=432856225110818817] [nodeID=707]
qn2.log:milvus-querynode2 | [2022/07/28 07:20:54.196 +00:00] [INFO] [querynode/impl.go:171] ["watchDeltaChannelsTask Enqueue done"] [collectionID=433802452728283137] [nodeID=707]
qn2.log:milvus-querynode2 | [2022/07/28 07:50:58.430 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=432856225019330561] [nodeID=707]
qn2.log:milvus-querynode2 | [2022/07/28 08:09:07.565 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=432856224953532417] [nodeID=707]
qn2.log:milvus-querynode2 | [2022/07/28 08:27:38.336 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=432856225110818817] [nodeID=707]
qn2.log:milvus-querynode2 | [2022/07/28 08:27:47.851 +00:00] [INFO] [querynode/impl.go:184] ["watchDeltaChannelsTask WaitToFinish done"] [collectionID=433802452728283137] [nodeID=707]
Bingo, watchDeltaTask worked very slow.
Checkpoint ts:
milvus-querynode2 | [2022/07/28 07:20:54.219 +00:00] [INFO] [querynode/segment_loader.go:633] ["from dml check point load delete"] [position="channel_name:\"by-dev-rootcoord-dml_3_432856225019330561v1\" msgID:\"\\010\\307\\254\\004\\020\\366\\242\\002\\030\\000 \\000\" msgGroup:\"by-dev-dataNode-467-432856225019330561\" timestamp:434420490202775553 "] ["msg id"=CMesBBD2ogIYACAA]
Parse this timestamp:
"physical time"=2022/07/07 16:30:56.108 +08:00
This IS caused by a remote checkpoint problem
This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2
Also, we could observe that recovered LoadBalanceTask
may not make sense in some cases.
This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2
It means that before my Milvus is upgraded from 2.0.2 to 2.1.0, I release all the collections in the 2.0.2 version, and after the upgrade, do the collection load on the 2.1.0 version?
This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2
It means that before my Milvus is upgraded from 2.0.2 to 2.1.0, I release all the collections in the 2.0.2 version, and after the upgrade, do the collection load on the 2.1.0 version? Then I will roll back the version now, after I release all the collections from 2.1.0 to 2.0.2, and then load all the collections after I upgrade from 2.0.2 to 2.1.0, is that right?
This problem shall be solved by PR #18394 #18497 And if you want a quick solution. Please release all the collection before upgrade from milvus 2.0.2
The latest docker corresponding to these two prs, milvusdb/milvus-dev:2.1.0-20220808-9899a5be has not solved the problem, and the current state is that there is no way to connect after startup, is there any way to restore by reading minio How about inserting data? I decompile the original data here and then redeploy and insert it.
@p363796822 any chance that you could retry on a completed new delpoyment?
@p363796822 any chance that you could retry on a completed new delpoyment?
I have tried it, using the latest milvusdb/milvus-dev: 2.1.0-20220808-9899a5be, but it still does not solve my problem,I have deleted the previous data and redeployed milvusdb/milvus:v2.1.0, and found a problem. When importing data in batches, the memory will explode.
I'm seeing this creating an IVF_SQ8 index with the milvusdb/milvus:master-20220906-d6b0fdc
image.
Milvus uses no CPU, pretty sure it's not making any actual progress.
I'll see if can make a repro.
@congqixia is this the same issue(reinstall and upgrade) that you are looking into ?
@p363796822 Milvus 2.1.2 is released. You can this new version to check whether your problem is solved.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.