milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: shard leader is unavailable after upgrade

Open BUPTAnderson opened this issue 2 years ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus-2.0.1
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 32Cores/192GB
- GPU: 
- Others:

Current Behavior

I deployed the cluster on 2022-06-24, and on 2022-07-06 I upgraded the cluster to the new version(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3) on a rolling upgrade, and after the upgrade a collection could not be queried. one querynode error log:

[2022/07/07 14:59:53.042 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636905] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.043 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.047 +08:00] [ERROR] [client.go:233] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n\t/opt/milvus/internal/util/grpcclient/client.go:233\ngithub.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n\t/opt/milvus/internal/distributed/querynode/client/client.go:227\ngithub.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n\t/opt/milvus/internal/querynode/shard_cluster.go:781"]
[2022/07/07 14:59:53.047 +08:00] [ERROR] [shard_cluster.go:798] ["Search 26 failed, reason  err err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search\n\t/opt/milvus/internal/querynode/shard_cluster.go:798\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).searchWithDmlChannel\n\t/opt/milvus/internal/querynode/impl.go:615\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Search.func1\n\t/opt/milvus/internal/querynode/impl.go:463\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/gopath/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2022/07/07 14:59:53.047 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636566] [collectionID=434211209087287297] [error="Search 26 failed, reason  err err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636614] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636614] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_32_434211209087287297v0 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.048 +08:00] [ERROR] [client.go:233] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n\t/opt/milvus/internal/util/grpcclient/client.go:233\ngithub.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n\t/opt/milvus/internal/distributed/querynode/client/client.go:227\ngithub.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n\t/opt/milvus/internal/querynode/shard_cluster.go:781"]
[2022/07/07 14:59:53.048 +08:00] [ERROR] [shard_cluster.go:798] ["Search 26 failed, reason  err err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search\n\t/opt/milvus/internal/querynode/shard_cluster.go:798\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).searchWithDmlChannel\n\t/opt/milvus/internal/querynode/impl.go:615\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Search.func1\n\t/opt/milvus/internal/querynode/impl.go:463\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/gopath/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636566] [collectionID=434211209087287297] [error="Search 26 failed, reason  err err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636028] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636028] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_32_434211209087287297v0 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.145 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636912] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]

Expected Behavior

after upgrade, collections can be queried correctly.

Steps To Reproduce

1. deploy a old version of branch `master`
2. rolling upgrade to branch `master(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3)`

Milvus Log

No response

Anything else?

No response

BUPTAnderson avatar Jul 08 '22 09:07 BUPTAnderson

It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?

jiaoew1991 avatar Jul 08 '22 10:07 jiaoew1991

It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?

Node-26 is the old queynode id(This node no longer exists in the current cluster). I upgraded querynode first, when all querynode upgrades are complete, i upgrade querycoord. Does it have anything to do with the order in which I upgraded?

BUPTAnderson avatar Jul 08 '22 10:07 BUPTAnderson

/assign @jiaoew1991 /unassign

yanliang567 avatar Jul 08 '22 13:07 yanliang567

It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?

Node-26 is the old queynode id(This node no longer exists in the current cluster). I upgraded querynode first, when all querynode upgrades are complete, i upgrade querycoord. Does it have anything to do with the order in which I upgraded?

@BUPTAnderson You can try following solutions

  1. use 2.1 branch which is more stable,the master branch is for 2.2 currently
  2. try https://github.com/congqixia/birdwatcher this tool to check the metadata on Etcd
  3. the final way, after upgrade, stop cluster, use https://github.com/congqixia/birdwatcher to kill all QueryNodes, then restart 😭

besides the solutions, could you provide more logs for us to find out the root cause? 😊

jiaoew1991 avatar Jul 10 '22 00:07 jiaoew1991

@jiaoew1991 I have collected all logs of the cluster, download link below: 链接: https://pan.baidu.com/s/158zMtXaKEIdH9CwbPlD9oQ?pwd=52sp 提取码: 52sp

beside, i found the datanode crash today after running about 2 days, i have created a new issue #18206

BUPTAnderson avatar Jul 10 '22 02:07 BUPTAnderson

@BUPTAnderson could you please provide query coord logs just before you upgrade the query coord?

weiliu1031 avatar Jul 13 '22 11:07 weiliu1031

@BUPTAnderson could you please provide query coord logs just before you upgrade the query coord?

tody i try to update the cluter to brach-2.1.0 latest, the problem still exists. Try to restart the querycoord node several times, the problem still exists. Try to release the collection, the request timeout.

BUPTAnderson avatar Jul 14 '22 05:07 BUPTAnderson

@weiliu1031 链接: https://pan.baidu.com/s/1l4fVTe3RGrrWEj1ApcOnfg?pwd=mw6j 提取码: mw6j pre is logs before upgrade, today is logs after upgrade.

BUPTAnderson avatar Jul 14 '22 05:07 BUPTAnderson

/assign @weiliu1031

jiaoew1991 avatar Jul 15 '22 09:07 jiaoew1991

after system reboot, something wrong happened in rebalance progress. when we try to do rebalance for a nodeDown event, in the last step: "sync segment", there is a logic problem:

  1. it just update the leader of shard which belong to the failed node
  2. the sync segment happens on every shard of the replica affected by the failed node.

so, when thre are more than one shard in a replica, and each shard belong to a different query node, after all query node reboot, when it tries to do sync segment (in rebalance), it failed, and the collection will not be ready for query.

weiliu1031 avatar Jul 18 '22 08:07 weiliu1031

@BUPTAnderson any chance to verify fix on v2.1..1

yanliang567 avatar Aug 11 '22 06:08 yanliang567