milvus
milvus copied to clipboard
[Bug]: shard leader is unavailable after upgrade
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus-2.0.1
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 32Cores/192GB
- GPU:
- Others:
Current Behavior
I deployed the cluster on 2022-06-24, and on 2022-07-06 I upgraded the cluster to the new version(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3) on a rolling upgrade, and after the upgrade a collection could not be queried. one querynode error log:
[2022/07/07 14:59:53.042 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636905] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.043 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.047 +08:00] [ERROR] [client.go:233] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n\t/opt/milvus/internal/util/grpcclient/client.go:233\ngithub.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n\t/opt/milvus/internal/distributed/querynode/client/client.go:227\ngithub.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n\t/opt/milvus/internal/querynode/shard_cluster.go:781"]
[2022/07/07 14:59:53.047 +08:00] [ERROR] [shard_cluster.go:798] ["Search 26 failed, reason err err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search\n\t/opt/milvus/internal/querynode/shard_cluster.go:798\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).searchWithDmlChannel\n\t/opt/milvus/internal/querynode/impl.go:615\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Search.func1\n\t/opt/milvus/internal/querynode/impl.go:463\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/gopath/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2022/07/07 14:59:53.047 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636566] [collectionID=434211209087287297] [error="Search 26 failed, reason err err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636614] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636614] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_32_434211209087287297v0 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.048 +08:00] [ERROR] [client.go:233] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n\t/opt/milvus/internal/util/grpcclient/client.go:233\ngithub.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n\t/opt/milvus/internal/distributed/querynode/client/client.go:227\ngithub.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n\t/opt/milvus/internal/querynode/shard_cluster.go:781"]
[2022/07/07 14:59:53.048 +08:00] [ERROR] [shard_cluster.go:798] ["Search 26 failed, reason err err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search\n\t/opt/milvus/internal/querynode/shard_cluster.go:798\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).searchWithDmlChannel\n\t/opt/milvus/internal/querynode/impl.go:615\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Search.func1\n\t/opt/milvus/internal/querynode/impl.go:463\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/gopath/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636566] [collectionID=434211209087287297] [error="Search 26 failed, reason err err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636028] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636028] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_32_434211209087287297v0 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.145 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636912] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
Expected Behavior
after upgrade, collections can be queried correctly.
Steps To Reproduce
1. deploy a old version of branch `master`
2. rolling upgrade to branch `master(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3)`
Milvus Log
No response
Anything else?
No response
It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?
It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?
Node-26 is the old queynode id(This node no longer exists in the current cluster). I upgraded querynode first, when all querynode upgrades are complete, i upgrade querycoord. Does it have anything to do with the order in which I upgraded?
/assign @jiaoew1991 /unassign
It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?
Node-26 is the old queynode id(This node no longer exists in the current cluster). I upgraded querynode first, when all querynode upgrades are complete, i upgrade querycoord. Does it have anything to do with the order in which I upgraded?
@BUPTAnderson You can try following solutions
- use 2.1 branch which is more stable,the master branch is for 2.2 currently
- try https://github.com/congqixia/birdwatcher this tool to check the metadata on Etcd
- the final way, after upgrade, stop cluster, use https://github.com/congqixia/birdwatcher to kill all QueryNodes, then restart 😭
besides the solutions, could you provide more logs for us to find out the root cause? 😊
@jiaoew1991 I have collected all logs of the cluster, download link below: 链接: https://pan.baidu.com/s/158zMtXaKEIdH9CwbPlD9oQ?pwd=52sp 提取码: 52sp
beside, i found the datanode crash today after running about 2 days, i have created a new issue #18206
@BUPTAnderson could you please provide query coord logs just before you upgrade the query coord?
@BUPTAnderson could you please provide query coord logs just before you upgrade the query coord?
tody i try to update the cluter to brach-2.1.0 latest, the problem still exists. Try to restart the querycoord node several times, the problem still exists. Try to release the collection, the request timeout.
@weiliu1031 链接: https://pan.baidu.com/s/1l4fVTe3RGrrWEj1ApcOnfg?pwd=mw6j 提取码: mw6j pre is logs before upgrade, today is logs after upgrade.
/assign @weiliu1031
after system reboot, something wrong happened in rebalance progress. when we try to do rebalance for a nodeDown event, in the last step: "sync segment", there is a logic problem:
- it just update the leader of shard which belong to the failed node
- the sync segment happens on every shard of the replica affected by the failed node.
so, when thre are more than one shard in a replica, and each shard belong to a different query node, after all query node reboot, when it tries to do sync segment (in rebalance), it failed, and the collection will not be ready for query.
@BUPTAnderson any chance to verify fix on v2.1..1