milvus
milvus copied to clipboard
[Bug]: Search failed with error message `ShardCluster for by-dev-rootcoord-dml_1_433972442115604481v1 replicaID 433972428256051202 is no available` after querynode pod kill chaos test
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master-20220617-7c69f4b3
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev69
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
search failed when running hello_milvus.py
2022-06-17T13:52:28.8255115Z Search...
2022-06-17T13:52:28.8313859Z Traceback (most recent call last):
2022-06-17T13:52:28.8314325Z File "chaos/scripts/hello_milvus.py", line 116, in <module>
2022-06-17T13:52:28.8314673Z hello_milvus(args.host)
2022-06-17T13:52:28.8315050Z File "chaos/scripts/hello_milvus.py", line 89, in hello_milvus
2022-06-17T13:52:28.8315385Z res = collection.search(
2022-06-17T13:52:28.8316309Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 689, in search
2022-06-17T13:52:28.8316809Z res = conn.search(self._name, data, anns_field, param, limit, expr,
2022-06-17T13:52:28.8317473Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 61, in handler
2022-06-17T13:52:28.8317858Z raise e
2022-06-17T13:52:28.8318415Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 44, in handler
2022-06-17T13:52:28.8318802Z return func(self, *args, **kwargs)
2022-06-17T13:52:28.8319390Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 79, in handler
2022-06-17T13:52:28.8319764Z raise e
2022-06-17T13:52:28.8320686Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 75, in handler
2022-06-17T13:52:28.8321104Z return func(*args, **kwargs)
2022-06-17T13:52:28.8321732Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 449, in search
2022-06-17T13:52:28.8322209Z return self._execute_search_requests(requests, timeout, **_kwargs)
2022-06-17T13:52:28.8322837Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 79, in handler
2022-06-17T13:52:28.8323194Z raise e
2022-06-17T13:52:28.8323741Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 75, in handler
2022-06-17T13:52:28.8324138Z return func(*args, **kwargs)
2022-06-17T13:52:28.8324795Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 411, in _execute_search_requests
2022-06-17T13:52:28.8325222Z raise pre_err
2022-06-17T13:52:28.8325838Z File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 402, in _execute_search_requests
2022-06-17T13:52:28.8326593Z raise MilvusException(response.status.error_code, response.status.reason)
2022-06-17T13:52:28.8327509Z pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to Search, QueryNode ID=40, reason=ShardCluster for by-dev-rootcoord-dml_1_433972442115604481v1 replicaID 433972428256051202 is no available)>
Expected Behavior
all test cases passed
Steps To Reproduce
see https://github.com/zhuwenxing/milvus/runs/6936878852?check_suite_focus=true
Milvus Log
failed job: https://github.com/zhuwenxing/milvus/runs/6936878852?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/6979133684/artifacts/272970452
Anything else?
No response
@jiaoew1991 could you please help to take a look at this issue?
/assign @jiaoew1991 /unassign
@yanliang567: GitHub didn't allow me to assign the following users: jiaoew1991.
Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
@jiaoew1991 could you please help to take a look at this issue?
/assign @jiaoew1991 /unassign
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Not reproduced in today's pod kill chaos test GitHub action, https://github.com/milvus-io/milvus/runs/6941136771?check_suite_focus=true
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7068/pipeline log:artifacts-querynode-pod-kill-7068-server-logs.tar.gz
same as https://github.com/milvus-io/milvus/issues/17203
/assign @yah01 /assign @letian-jiang
plz help to take a look
version master-20220623-e8f53af7
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7092/pipeline
log: artifacts-querynode-pod-kill-7092-server-logs.tar.gz
It seems that the frequency of this issue has become higher
failed job: https://github.com/milvus-io/milvus/runs/7030532804?check_suite_focus=true log: https://github.com/milvus-io/milvus/suites/7067419316/artifacts/278853448
@yah01
It still happened in version master-20220624-94a51220
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7140/pipeline log: artifacts-querynode-pod-kill-7140-server-logs.tar.gz
/assign @zhuwenxing plz check with #17774
/unassign /assign @yah01
In version master-20220625-5471e35c
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7208/pipeline
log: artifacts-querynode-pod-kill-7208-server-logs.tar.gz
In some other cases, this issue also happened! datanode pod kill(Milvus with kafka mq): https://github.com/zhuwenxing/milvus/runs/7062361984?check_suite_focus=true (no querynode has been killed or restarted)
cluster restart (Milvus with kafka mq): https://github.com/zhuwenxing/milvus/runs/7063288498?check_suite_focus=true (This issue seems reproduced in this case stably)
/assign @zhuwenxing plz check with the lastest master, many related fixes have been merged
Not reproduced in version master-20220628-b657e583
see https://github.com/zhuwenxing/milvus/actions/runs/2575286937
It is reproduced in the version master-20220630-6ab850be
failed job: https://github.com/milvus-io/milvus/runs/7138344034?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/7168240171/artifacts/285614836
It is reproduced in the version
master-20220630-6ab850befailed job: https://github.com/milvus-io/milvus/runs/7138344034?check_suite_focus=true log: https://github.com/milvus-io/milvus/suites/7168240171/artifacts/285614836
@yah01
please help to take a look!
It also happened in querynode pod kill with kafka as mq
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test-kafka/detail/chaos-test-kafka/234/pipeline log: artifacts-querynode-pod-kill-234-server-logs.tar.gz
/assign @zhuwenxing should fixed with #18002 #18018
Not reproduced yet, remove the critical label.
Version master-20220704-1fd3ded8
Search failed when running verify_all_collections.py after querynode pod failure chaos
The error message has changed to
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=channel by-dev-rootcoord-dml_17_434357994148331521v1 leader is not here)>
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/7830/pipeline log: artifacts-querynode-pod-failure-7830-server-logs.tar.gz
/assign @yah01 please help to take a look
@congqixia I checked the logs, QueryCoord has correctly updated the shard leader to the new node and synced segments distribution to the new leaders, should querynode return an error here to indicate proxy to update cache?
/assign @zhuwenxing fixed with #18055
I tried 5 times, but it has not yet been reproduced in master-20220706-1c9647ff.
Keep watching it.
/unassign @yah01
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.
/assign @zhuwenxing is this still a issue on current master?
Not reproduced anymore, so close it