milvus [Bug]: Search failed with error message `ShardCluster for by-dev-rootcoord-dml_1_433972442115604481v1 replicaID 433972428256051202 is no available` after querynode pod kill chaos test

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: master-20220617-7c69f4b3
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev69
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

search failed when running hello_milvus.py

2022-06-17T13:52:28.8255115Z Search...
2022-06-17T13:52:28.8313859Z Traceback (most recent call last):
2022-06-17T13:52:28.8314325Z   File "chaos/scripts/hello_milvus.py", line 116, in <module>
2022-06-17T13:52:28.8314673Z     hello_milvus(args.host)
2022-06-17T13:52:28.8315050Z   File "chaos/scripts/hello_milvus.py", line 89, in hello_milvus
2022-06-17T13:52:28.8315385Z     res = collection.search(
2022-06-17T13:52:28.8316309Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 689, in search
2022-06-17T13:52:28.8316809Z     res = conn.search(self._name, data, anns_field, param, limit, expr,
2022-06-17T13:52:28.8317473Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 61, in handler
2022-06-17T13:52:28.8317858Z     raise e
2022-06-17T13:52:28.8318415Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 44, in handler
2022-06-17T13:52:28.8318802Z     return func(self, *args, **kwargs)
2022-06-17T13:52:28.8319390Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 79, in handler
2022-06-17T13:52:28.8319764Z     raise e
2022-06-17T13:52:28.8320686Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 75, in handler
2022-06-17T13:52:28.8321104Z     return func(*args, **kwargs)
2022-06-17T13:52:28.8321732Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 449, in search
2022-06-17T13:52:28.8322209Z     return self._execute_search_requests(requests, timeout, **_kwargs)
2022-06-17T13:52:28.8322837Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 79, in handler
2022-06-17T13:52:28.8323194Z     raise e
2022-06-17T13:52:28.8323741Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 75, in handler
2022-06-17T13:52:28.8324138Z     return func(*args, **kwargs)
2022-06-17T13:52:28.8324795Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 411, in _execute_search_requests
2022-06-17T13:52:28.8325222Z     raise pre_err
2022-06-17T13:52:28.8325838Z   File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 402, in _execute_search_requests
2022-06-17T13:52:28.8326593Z     raise MilvusException(response.status.error_code, response.status.reason)
2022-06-17T13:52:28.8327509Z pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to Search, QueryNode ID=40, reason=ShardCluster for by-dev-rootcoord-dml_1_433972442115604481v1 replicaID 433972428256051202 is no available)>

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/zhuwenxing/milvus/runs/6936878852?check_suite_focus=true

Milvus Log

failed job: https://github.com/zhuwenxing/milvus/runs/6936878852?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/6979133684/artifacts/272970452

Anything else?

No response

Jun 17 '22 14:06 zhuwenxing

@jiaoew1991 could you please help to take a look at this issue?

/assign @jiaoew1991 /unassign

Jun 17 '22 15:06 yanliang567

@yanliang567: GitHub didn't allow me to assign the following users: jiaoew1991.

Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

@jiaoew1991 could you please help to take a look at this issue?

/assign @jiaoew1991 /unassign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 17 '22 15:06 sre-ci-robot

Not reproduced in today's pod kill chaos test GitHub action, https://github.com/milvus-io/milvus/runs/6941136771?check_suite_focus=true

Jun 18 '22 02:06 zhuwenxing

failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7068/pipeline log:artifacts-querynode-pod-kill-7068-server-logs.tar.gz

Jun 23 '22 03:06 zhuwenxing

same as https://github.com/milvus-io/milvus/issues/17203

Jun 23 '22 03:06 zhuwenxing

/assign @yah01 /assign @letian-jiang

plz help to take a look

Jun 23 '22 05:06 zhuwenxing

version master-20220623-e8f53af7 failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7092/pipeline log: artifacts-querynode-pod-kill-7092-server-logs.tar.gz

Jun 24 '22 02:06 zhuwenxing

It seems that the frequency of this issue has become higher

failed job: https://github.com/milvus-io/milvus/runs/7030532804?check_suite_focus=true log: https://github.com/milvus-io/milvus/suites/7067419316/artifacts/278853448

Jun 24 '22 02:06 zhuwenxing

@yah01 It still happened in version master-20220624-94a51220

failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7140/pipeline log: artifacts-querynode-pod-kill-7140-server-logs.tar.gz

Jun 24 '22 05:06 zhuwenxing

/assign @zhuwenxing plz check with #17774

Jun 25 '22 03:06 yah01

/unassign /assign @yah01

In version master-20220625-5471e35c failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7208/pipeline log: artifacts-querynode-pod-kill-7208-server-logs.tar.gz

Jun 25 '22 13:06 zhuwenxing

In some other cases, this issue also happened! datanode pod kill(Milvus with kafka mq): https://github.com/zhuwenxing/milvus/runs/7062361984?check_suite_focus=true (no querynode has been killed or restarted)

cluster restart (Milvus with kafka mq): https://github.com/zhuwenxing/milvus/runs/7063288498?check_suite_focus=true (This issue seems reproduced in this case stably)

Jun 27 '22 02:06 zhuwenxing

/assign @zhuwenxing plz check with the lastest master, many related fixes have been merged

Jun 28 '22 02:06 yah01

Not reproduced in version master-20220628-b657e583 see https://github.com/zhuwenxing/milvus/actions/runs/2575286937

Jun 28 '22 13:06 zhuwenxing

It is reproduced in the version master-20220630-6ab850be failed job: https://github.com/milvus-io/milvus/runs/7138344034?check_suite_focus=true log: https://github.com/milvus-io/milvus/suites/7168240171/artifacts/285614836

Jul 01 '22 02:07 zhuwenxing

It is reproduced in the version master-20220630-6ab850be failed job: https://github.com/milvus-io/milvus/runs/7138344034?check_suite_focus=true log: https://github.com/milvus-io/milvus/suites/7168240171/artifacts/285614836

@yah01

please help to take a look!

Jul 01 '22 02:07 zhuwenxing

It also happened in querynode pod kill with kafka as mq

failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test-kafka/detail/chaos-test-kafka/234/pipeline log: artifacts-querynode-pod-kill-234-server-logs.tar.gz

Jul 01 '22 03:07 zhuwenxing

/assign @zhuwenxing should fixed with #18002 #18018

Jul 02 '22 03:07 yah01

Not reproduced yet, remove the critical label.

Jul 04 '22 08:07 zhuwenxing

Version master-20220704-1fd3ded8

Search failed when running verify_all_collections.py after querynode pod failure chaos The error message has changed to

pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=channel by-dev-rootcoord-dml_17_434357994148331521v1 leader is not here)>

failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7830/pipeline log: artifacts-querynode-pod-failure-7830-server-logs.tar.gz

Jul 05 '22 03:07 zhuwenxing

/assign @yah01 please help to take a look

Jul 05 '22 03:07 zhuwenxing

@congqixia I checked the logs, QueryCoord has correctly updated the shard leader to the new node and synced segments distribution to the new leaders, should querynode return an error here to indicate proxy to update cache?

Jul 05 '22 03:07 yah01

/assign @zhuwenxing fixed with #18055

Jul 05 '22 05:07 yah01

I tried 5 times, but it has not yet been reproduced in master-20220706-1c9647ff. Keep watching it. /unassign @yah01

Jul 06 '22 08:07 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Aug 05 '22 10:08 stale[bot]

/assign @zhuwenxing is this still a issue on current master?

Aug 10 '22 14:08 xiaofan-luan

Not reproduced anymore, so close it

Aug 15 '22 02:08 zhuwenxing

milvus milvus copied to clipboard

[Bug]: Search failed with error message `ShardCluster for by-dev-rootcoord-dml_1_433972442115604481v1 replicaID 433972428256051202 is no available` after querynode pod kill chaos test

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard