milvus [Bug]: shard leader is unavailable after upgrade

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: master(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus-2.0.1
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 32Cores/192GB
- GPU: 
- Others:

Current Behavior

I deployed the cluster on 2022-06-24, and on 2022-07-06 I upgraded the cluster to the new version(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3) on a rolling upgrade, and after the upgrade a collection could not be queried. one querynode error log:

[2022/07/07 14:59:53.042 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636905] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.043 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.047 +08:00] [ERROR] [client.go:233] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n\t/opt/milvus/internal/util/grpcclient/client.go:233\ngithub.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n\t/opt/milvus/internal/distributed/querynode/client/client.go:227\ngithub.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n\t/opt/milvus/internal/querynode/shard_cluster.go:781"]
[2022/07/07 14:59:53.047 +08:00] [ERROR] [shard_cluster.go:798] ["Search 26 failed, reason  err err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search\n\t/opt/milvus/internal/querynode/shard_cluster.go:798\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).searchWithDmlChannel\n\t/opt/milvus/internal/querynode/impl.go:615\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Search.func1\n\t/opt/milvus/internal/querynode/impl.go:463\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/gopath/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2022/07/07 14:59:53.047 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636566] [collectionID=434211209087287297] [error="Search 26 failed, reason  err err: rpc error: code = Canceled desc = grpc: the client connection is closing\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636614] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636614] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_32_434211209087287297v0 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.048 +08:00] [ERROR] [client.go:233] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n\t/opt/milvus/internal/util/grpcclient/client.go:233\ngithub.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n\t/opt/milvus/internal/distributed/querynode/client/client.go:227\ngithub.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n\t/opt/milvus/internal/querynode/shard_cluster.go:781"]
[2022/07/07 14:59:53.048 +08:00] [ERROR] [shard_cluster.go:798] ["Search 26 failed, reason  err err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search\n\t/opt/milvus/internal/querynode/shard_cluster.go:798\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).searchWithDmlChannel\n\t/opt/milvus/internal/querynode/impl.go:615\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Search.func1\n\t/opt/milvus/internal/querynode/impl.go:463\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/gopath/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2022/07/07 14:59:53.048 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636566] [collectionID=434211209087287297] [error="Search 26 failed, reason  err err: rpc error: code = Canceled desc = context canceled\n, /opt/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/opt/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call\n/opt/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search\n/opt/milvus/internal/querynode/shard_cluster.go:781 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Search.func2\n/usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit\n"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636028] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636028] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_32_434211209087287297v0 replicaID 434239054483320064 is no available"]
[2022/07/07 14:59:53.142 +08:00] [WARN] [task_scheduler.go:326] ["search context timeout"]
[2022/07/07 14:59:53.145 +08:00] [WARN] [impl.go:617] ["search cluster failed"] [msgID=434418948667636912] [collectionID=434211209087287297] [error="ShardCluster for by-dev-rootcoord-dml_33_434211209087287297v1 replicaID 434239054483320064 is no available"]

Expected Behavior

after upgrade, collections can be queried correctly.

Steps To Reproduce

1. deploy a old version of branch `master`
2. rolling upgrade to branch `master(commit: 543197c2bee8db02ae453e005fe6b37e85c242c3)`

Milvus Log

No response

Anything else?

No response

Jul 08 '22 09:07 BUPTAnderson

It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?

Jul 08 '22 10:07 jiaoew1991

It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?

Node-26 is the old queynode id(This node no longer exists in the current cluster). I upgraded querynode first, when all querynode upgrades are complete, i upgrade querycoord. Does it have anything to do with the order in which I upgraded?

Jul 08 '22 10:07 BUPTAnderson

/assign @jiaoew1991 /unassign

Jul 08 '22 13:07 yanliang567

It seems like the shard leader try to access Node-26 which not exist. @BUPTAnderson could you confirm the Node-26 is online or provide more logs for us ?

Node-26 is the old queynode id(This node no longer exists in the current cluster). I upgraded querynode first, when all querynode upgrades are complete, i upgrade querycoord. Does it have anything to do with the order in which I upgraded?

@BUPTAnderson You can try following solutions

use 2.1 branch which is more stable，the master branch is for 2.2 currently
try https://github.com/congqixia/birdwatcher this tool to check the metadata on Etcd
the final way, after upgrade, stop cluster, use https://github.com/congqixia/birdwatcher to kill all QueryNodes, then restart 😭

besides the solutions, could you provide more logs for us to find out the root cause? 😊

Jul 10 '22 00:07 jiaoew1991

@jiaoew1991 I have collected all logs of the cluster, download link below: 链接: https://pan.baidu.com/s/158zMtXaKEIdH9CwbPlD9oQ?pwd=52sp 提取码: 52sp

beside, i found the datanode crash today after running about 2 days, i have created a new issue #18206

Jul 10 '22 02:07 BUPTAnderson

@BUPTAnderson could you please provide query coord logs just before you upgrade the query coord?

Jul 13 '22 11:07 weiliu1031

@BUPTAnderson could you please provide query coord logs just before you upgrade the query coord?

tody i try to update the cluter to brach-2.1.0 latest, the problem still exists. Try to restart the querycoord node several times, the problem still exists. Try to release the collection, the request timeout.

Jul 14 '22 05:07 BUPTAnderson

@weiliu1031 链接: https://pan.baidu.com/s/1l4fVTe3RGrrWEj1ApcOnfg?pwd=mw6j 提取码: mw6j pre is logs before upgrade, today is logs after upgrade.

Jul 14 '22 05:07 BUPTAnderson

/assign @weiliu1031

Jul 15 '22 09:07 jiaoew1991

after system reboot, something wrong happened in rebalance progress. when we try to do rebalance for a nodeDown event, in the last step: "sync segment", there is a logic problem:

it just update the leader of shard which belong to the failed node
the sync segment happens on every shard of the replica affected by the failed node.

so, when thre are more than one shard in a replica, and each shard belong to a different query node, after all query node reboot, when it tries to do sync segment (in rebalance), it failed, and the collection will not be ready for query.

Jul 18 '22 08:07 weiliu1031

@BUPTAnderson any chance to verify fix on v2.1..1

Aug 11 '22 06:08 yanliang567

milvus milvus copied to clipboard

[Bug]: shard leader is unavailable after upgrade

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard