milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Search failed with error fail to search on all shard leaders, err=fail to Search, QueryNode ID=17, reason=ShardCluster for by-dev-rootcoord-dml_9_xxxv1 replicaID xxx is no available after etcd pod kill chaos test

Open zhuwenxing opened this issue 2 years ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.1.0-20220822-ab30ebb1
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.2.0.dev19
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The search failed when running verify_all_collections.py after etcd pod kill chaos test image

14:18:47  check collection Checker__84jtsD6A
14:18:47  collection is exist
14:18:47  
14:18:47  Create collection...
14:18:47  
14:18:47  Insert 3000 vectors cost 0.1239 seconds
14:18:47  
14:18:47  Get collection entities...
14:18:47  36370
14:18:47  
14:18:47  Get collection entities cost 3.0188 seconds
14:18:47  
14:18:47  Create index...
14:18:47  
14:18:47  Create index cost 0.5102 seconds
14:18:47  
14:18:47  Get replicas number
14:18:47  
14:18:47  Replicas number is 1
14:18:47  
14:18:47  load collection...
14:18:47  
14:18:47  load collection cost 0.0042 seconds
14:18:47  
14:18:47  Search...
14:18:47  Traceback (most recent call last):
14:18:47    File "scripts/verify_all_collections.py", line 138, in <module>
14:18:47      hello_milvus(collection_name)
14:18:47    File "scripts/verify_all_collections.py", line 95, in hello_milvus
14:18:47      "int64 > 100", output_fields=["int64", "float"], timeout=TIMEOUT
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/orm/collection.py", line 718, in search
14:18:47      partition_names, output_fields, round_decimal, timeout=timeout, schema=schema_dict, **kwargs)
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 96, in handler
14:18:47      raise e
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 92, in handler
14:18:47      return func(*args, **kwargs)
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 74, in handler
14:18:47      raise e
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/decorators.py", line 48, in handler
14:18:47      return func(self, *args, **kwargs)
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 452, in search
14:18:47      return self._execute_search_requests(requests, timeout, **_kwargs)
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 416, in _execute_search_requests
14:18:47      raise pre_err
14:18:47    File "/usr/local/lib/python3.7/dist-packages/pymilvus/client/grpc_handler.py", line 407, in _execute_search_requests
14:18:47      raise MilvusException(response.status.error_code, response.status.reason)
14:18:47  pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=17, reason=ShardCluster for by-dev-rootcoord-dml_9_435482766336851969v1 replicaID 435482756135518244 is no available)>

Expected Behavior

all test cases passed

Steps To Reproduce

No response

Milvus Log

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test/detail/chaos-test/675/pipeline log: artifacts-etcd-pod-kill-675-server-logs.tar.gz artifacts-etcd-pod-kill-675-pytest-logs.tar.gz

Anything else?

This issue happened stably

zhuwenxing avatar Aug 23 '22 09:08 zhuwenxing

/assign @jiaoew1991 /unassign

yanliang567 avatar Aug 24 '22 00:08 yanliang567

maybe we should waiting for QueryCoordV2's code checkin. the behavior of Chaos Test will be better

jiaoew1991 avatar Aug 24 '22 02:08 jiaoew1991

maybe we should waiting for QueryCoordV2's code checkin. the behavior of Chaos Test will be better

But this happened in 2.1 branch, so it should be fixed before next release

zhuwenxing avatar Aug 24 '22 02:08 zhuwenxing

/assign @aoiasd

jiaoew1991 avatar Aug 24 '22 03:08 jiaoew1991

The main reason is that before restart the queryNode, some segment dropped because of compaction, but when we restart node, we will just reload the segments in bin logs(no dropped segment) and the segment info of dropped segment in etcd will still be the old node, so when querycoord get query task, it will find some segment in dropped node and get failed.

aoiasd avatar Aug 30 '22 11:08 aoiasd

chaos type: pod-kill image tag: 2.1.0-20220913-3c3ba55 target pod: etcd

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release/detail/chaos-test-for-release/549/pipeline

log: artifacts-etcd-pod-kill-549-server-logs.tar.gz artifacts-etcd-pod-kill-549-pytest-logs.tar.gz

zhuwenxing avatar Sep 15 '22 07:09 zhuwenxing

we can retry it with querycoordV2

/assign @zhuwenxing /unassign

jiaoew1991 avatar Sep 16 '22 02:09 jiaoew1991

Bad news, after querycoordV2 merged in master, the pod kill chaos test almost failed due to load collection timeout. but etcd is OK.

I would open a new issue to trace the load timeout problem.

zhuwenxing avatar Sep 16 '22 04:09 zhuwenxing

chaos type: pod-kill image tag: 2.1.0-20220916-3c3ba55c target pod: etcd failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release/detail/chaos-test-for-release/714/pipeline/

log: artifacts-etcd-pod-kill-714-pytest-logs.tar.gz artifacts-etcd-pod-kill-714-server-logs.tar.gz

zhuwenxing avatar Sep 21 '22 06:09 zhuwenxing

for 2.1 branch, this issue is reproduced stable.

/assign @jiaoew1991

please help to take a look!

zhuwenxing avatar Sep 21 '22 06:09 zhuwenxing

/unassign

zhuwenxing avatar Sep 21 '22 06:09 zhuwenxing

It was not reproduced in the master branch, but it still happened in 2.1 branch (2.1.0-20220930-706b8e98).

Since querycoord architecture has been refactored in master, and it is hard to merge into the 2.1 branch, this issue would not be fixed, and recommend using the later release if this issue occurs.

zhuwenxing avatar Oct 11 '22 11:10 zhuwenxing

/assign @zhuwenxing /unassign

jiaoew1991 avatar Oct 11 '22 11:10 jiaoew1991