milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Search failed with error `Search 2 failed, reason query shard(channel) by-dev-rootcoord-dml_3_437739255564537236v1 does not exist` after pulsar pod failure chaos test

Open zhuwenxing opened this issue 2 years ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20221130-67390d20
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.3.0.dev15
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - INFO - ci_test]: [test][2022-11-30T21:20:19Z] [0.00206491s] Hello_Milvus flush -> None (wrapper.py:30)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - INFO - ci_test]: assert flush: 2.0231664180755615, entities: 9000 (test_data_persistence.py:45)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - INFO - ci_test]: index info: [{'collection': 'Hello_Milvus', 'field': 'float_vector', 'index_name': 'test_HLbXFvT4', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Hello_Milvus', 'field': 'varchar', 'index_name': 'test_rJL6bPkC', 'index_param': {'index_type': 'Trie'}}] (test_data_persistence.py:64)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - INFO - ci_test]: [test][2022-11-30T21:20:19Z] [0.00523320s] Hello_Milvus load -> None (wrapper.py:30)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.04650967923462504, 0.09359632712206453, 0.09375678212618015, 0.05750658831758217, 0.12979588567542033, 0.08243681298128856, 0.022396676907106085, 0.07546737456769882, 0.10166835352461175, 0.0890813797380326, 0.13020253195002898, 0.0245454026352224, 0.11237346686014102, 0.015401923391799665, 0.1......, kwargs: {} (api_request.py:56)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=7, reason=Search 2 failed, reason query shard(channel)  by-dev-rootcoord-dml_3_437739255564537236v1  does not exist

[2022-11-30T21:20:20.057Z]  err %!w(<nil>))>, <Time:{'RPC start': '2022-11-30 21:20:19.559414', 'RPC error': '2022-11-30 21:20:19.783251'}> (decorators.py:108)

[2022-11-30T21:20:20.057Z] [2022-11-30 21:20:19 - ERROR - ci_test]: Traceback (most recent call last):

[2022-11-30T21:20:20.057Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2022-11-30T21:20:20.057Z]     res = func(*args, **_kwargs)

[2022-11-30T21:20:20.057Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2022-11-30T21:20:20.057Z]     return func(*arg, **kwargs)

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 610, in search

[2022-11-30T21:20:20.057Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2022-11-30T21:20:20.057Z]     raise e

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2022-11-30T21:20:20.057Z]     return func(*args, **kwargs)

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2022-11-30T21:20:20.057Z]     ret = func(self, *args, **kwargs)

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2022-11-30T21:20:20.057Z]     raise e

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-11-30T21:20:20.057Z]     return func(self, *args, **kwargs)

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 469, in search

[2022-11-30T21:20:20.057Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 438, in _execute_search_requests

[2022-11-30T21:20:20.057Z]     raise pre_err

[2022-11-30T21:20:20.057Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 429, in _execute_search_requests

[2022-11-30T21:20:20.057Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-11-30T21:20:20.057Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=7, reason=Search 2 failed, reason query shard(channel)  by-dev-rootcoord-dml_3_437739255564537236v1  does not exist

[2022-11-30T21:20:20.058Z]  err %!w(<nil>))>

[2022-11-30T21:20:20.058Z]  (api_request.py:39)

[2022-11-30T21:20:20.058Z] [2022-11-30 21:20:19 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=7, reason=Search 2 failed, reason query shard(channel)  by-dev-rootcoord-dml_3_437739255564537236v1  does not exist

[2022-11-30T21:20:20.058Z]  err %!w(<nil>))> (api_request.py:40)

Expected Behavior

all test cases passed

Steps To Reproduce

No response

Milvus Log

chaos type: pod-failure image tag: master-20221130-67390d20 target pod: pulsar failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/247/pipeline log:

artifacts-pulsar-pod-failure-247-server-logs.tar.gz

artifacts-pulsar-pod-failure-247-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing avatar Dec 01 '22 02:12 zhuwenxing

/assign @jiaoew1991 /unassign

yanliang567 avatar Dec 01 '22 12:12 yanliang567

/assign @aoiasd /unassign

jiaoew1991 avatar Dec 05 '22 06:12 jiaoew1991

This problem has been bothering me, and it has not been resolved yet. The problem comes from issue 21324

smallcai03 avatar Dec 22 '22 06:12 smallcai03

It was reproduced in 2.2.0-20230116-3a5f38b1

[2023-01-16T23:15:12.029Z] [2023-01-16 23:14:52 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-01-16T23:15:12.029Z] [2023-01-16 23:14:52 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-01-16T23:15:12.029Z] [2023-01-16 23:14:52 - INFO - ci_test]: [test][2023-01-16T23:14:52Z] [0.00717665s] SearchChecker__DGgJKDXD load -> None (wrapper.py:30)

[2023-01-16T23:15:12.029Z] [2023-01-16 23:14:52 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.004736048288464679, 0.009620340391613972, 0.08556947487657732, 0.1232704381627744, 0.0474209602725107, 0.03195405985025566, 0.09706087773977706, 0.14389298676275802, 0.13296566682522157, 0.11703348228419408, 0.10078517190687802, 0.11420135602802678, 0.02528739878783225, 0.028994504250801856, 0......., kwargs: {} (api_request.py:56)

[2023-01-16T23:15:12.029Z] [2023-01-16 23:14:52 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=20, reason=Search 21 failed, reason query shard(channel)  by-dev-rootcoord-dml_23_438805441546227711v1  does not exist

[2023-01-16T23:15:12.029Z]  err %!w(<nil>))>, <Time:{'RPC start': '2023-01-16 23:14:52.714673', 'RPC error': '2023-01-16 23:14:52.980790'}> (decorators.py:108)

[2023-01-16T23:15:12.029Z] [2023-01-16 23:14:52 - ERROR - ci_test]: Traceback (most recent call last):

[2023-01-16T23:15:12.029Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-01-16T23:15:12.029Z]     res = func(*args, **_kwargs)

[2023-01-16T23:15:12.029Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-01-16T23:15:12.029Z]     return func(*arg, **kwargs)

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 609, in search

[2023-01-16T23:15:12.029Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-01-16T23:15:12.029Z]     raise e

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-01-16T23:15:12.029Z]     return func(*args, **kwargs)

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-01-16T23:15:12.029Z]     ret = func(self, *args, **kwargs)

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-01-16T23:15:12.029Z]     raise e

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-01-16T23:15:12.029Z]     return func(self, *args, **kwargs)

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 470, in search

[2023-01-16T23:15:12.029Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 439, in _execute_search_requests

[2023-01-16T23:15:12.029Z]     raise pre_err

[2023-01-16T23:15:12.029Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 430, in _execute_search_requests

[2023-01-16T23:15:12.029Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-01-16T23:15:12.029Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=20, reason=Search 21 failed, reason query shard(channel)  by-dev-rootcoord-dml_23_438805441546227711v1  does not exist

[2023-01-16T23:15:12.029Z]  err %!w(<nil>))>

[2023-01-16T23:15:12.029Z]  (api_request.py:39)

[2023-01-16T23:15:12.029Z] [2023-01-16 23:14:52 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=20, reason=Search 21 failed, reason query shard(channel)  by-dev-rootcoord-dml_23_438805441546227711v1  does not exist

[2023-01-16T23:15:12.029Z]  err %!w(<nil>))> (api_request.py:40)

[2023-01-16T23:15:12.029Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-01-16T23:15:12.029Z] =========================== short test summary info ============================

[2023-01-16T23:15:12.029Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DeleteChecker__1ecIfg9u] - AssertionError

[2023-01-16T23:15:12.029Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__DGgJKDXD] - AssertionError

[2023-01-16T23:15:12.029Z] =================== 2 failed, 10 passed in 77.02s (0:01:17) ====================

script returned exit code 1

chaos type: pod-kill image tag: 2.2.0-20230116-3a5f38b1 target pod: querynode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/1282/pipeline

log:

artifacts-querynode-pod-kill-1282-server-logs.tar.gz artifacts-querynode-pod-kill-1282-pytest-logs.tar.gz

zhuwenxing avatar Jan 17 '23 02:01 zhuwenxing

@aoiasd

Please take a look

zhuwenxing avatar Jan 17 '23 02:01 zhuwenxing

@aoiasd

Please take a look

OK

aoiasd avatar Jan 17 '23 02:01 aoiasd

first err: one querynode restart because pulsar and fetch search task immediately before watchDeltaChannel,So could not get shard second err: related https://github.com/milvus-io/milvus/issues/21357 collection has two vchannel v0 and v1,one node has shard leader of v0 and one shard of v1, querycoord want to unsubscribe v0, but unsubChannelTask unsubscribe both.

aoiasd avatar Jan 17 '23 07:01 aoiasd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Mar 17 '23 04:03 stale[bot]

吼吼,知道了~~

smallcai03 avatar Mar 17 '23 04:03 smallcai03

Fixed at https://github.com/milvus-io/milvus/pull/21794

aoiasd avatar May 24 '23 09:05 aoiasd

吼吼,知道了~~

smallcai03 avatar May 24 '23 09:05 smallcai03

/unassign @aoiasd /assign @zhuwenxing pls verify it

jiaoew1991 avatar May 25 '23 03:05 jiaoew1991

吼吼,知道了~~

smallcai03 avatar Jun 26 '23 01:06 smallcai03