milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Search and query failed with the error `_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED` when concurrent insert/upsert/delete/search/query requests are sent

Open zhuwenxing opened this issue 1 year ago • 22 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:PR-29424-20240103-74ad0c040
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ['int64 in [842, 207, 326, 238, 1924]', None, None, 10], kwargs: {} (api_request.py:62)

[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_15_446764440316416579v0])>, <Time:{'RPC start': '2024-01-03 09:39:15.154966', 'RPC error': '2024-01-03 09:39:15.178422'}> (decorators.py:128)

[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-03T09:39:38.084Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-03T09:39:38.084Z]     res = func(*args, **_kwargs)

[2024-01-03T09:39:38.084Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-03T09:39:38.084Z]     return func(*arg, **kwargs)

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 915, in query

[2024-01-03T09:39:38.084Z]     return conn.query(

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 129, in handler

[2024-01-03T09:39:38.084Z]     raise e from e

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 125, in handler

[2024-01-03T09:39:38.084Z]     return func(*args, **kwargs)

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 164, in handler

[2024-01-03T09:39:38.084Z]     return func(self, *args, **kwargs)

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 104, in handler

[2024-01-03T09:39:38.084Z]     raise e from e

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 68, in handler

[2024-01-03T09:39:38.084Z]     return func(*args, **kwargs)

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1314, in query

[2024-01-03T09:39:38.084Z]     check_status(response.status)

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 54, in check_status

[2024-01-03T09:39:38.084Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-03T09:39:38.084Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_15_446764440316416579v0])>

[2024-01-03T09:39:38.084Z]  (api_request.py:45)

[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_15_446764440316416579v0])> (api_request.py:46)

[2024-01-03T09:39:38.083Z] [2024-01-03 09:39:14 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.10028257037807405, 0.11010703908291551, 0.1392804370425213, 0.1043137217595161, 0.06713751283051579, 0.012174845508086932, 0.02535379578369703, 0.011733175588496571, 0.023539246659198022, 0.12330800922598613, 0.041244465056191115, 0.08051328049227138, 0.07585501537845386, 0.061646895741733196, ......, kwargs: {} (api_request.py:62)

[2024-01-03T09:39:38.083Z] [2024-01-03 09:39:14 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_0_446764440316416579v1])>, <Time:{'RPC start': '2024-01-03 09:39:14.625148', 'RPC error': '2024-01-03 09:39:14.628038'}> (decorators.py:128)

[2024-01-03T09:39:38.083Z] [2024-01-03 09:39:14 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-03T09:39:38.083Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-03T09:39:38.083Z]     res = func(*args, **_kwargs)

[2024-01-03T09:39:38.083Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-03T09:39:38.083Z]     return func(*arg, **kwargs)

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 782, in search

[2024-01-03T09:39:38.083Z]     resp = conn.search(

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 129, in handler

[2024-01-03T09:39:38.083Z]     raise e from e

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 125, in handler

[2024-01-03T09:39:38.083Z]     return func(*args, **kwargs)

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 164, in handler

[2024-01-03T09:39:38.083Z]     return func(self, *args, **kwargs)

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 104, in handler

[2024-01-03T09:39:38.083Z]     raise e from e

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 68, in handler

[2024-01-03T09:39:38.083Z]     return func(*args, **kwargs)

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 748, in search

[2024-01-03T09:39:38.083Z]     return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)

[2024-01-03T09:39:38.083Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 709, in _execute_search

[2024-01-03T09:39:38.083Z]     raise e from e

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 702, in _execute_search

[2024-01-03T09:39:38.084Z]     check_status(response.status)

[2024-01-03T09:39:38.084Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 54, in check_status

[2024-01-03T09:39:38.084Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-03T09:39:38.084Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_0_446764440316416579v1])>

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed pipeline: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29424/8/pipeline

Anything else?

I don't see pods restarted for the Milvus release

zhuwenxing avatar Jan 03 '24 11:01 zhuwenxing

/assign @congqixia /unassign

yanliang567 avatar Jan 03 '24 11:01 yanliang567

@congqixia I have rebased the latest commit in the master branch, and the ci still failed. failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29424/14/pipeline/26

[2024-01-05T09:14:11.333Z] [2024-01-05 09:13:55 - ERROR - pymilvus.decorators]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-01-05 09:13:45.928331', 'gRPC error': '2024-01-05 09:13:55.930154'}> (decorators.py:145)
[2024-01-05T09:14:11.333Z] [2024-01-05 09:13:55 - ERROR - pymilvus.decorators]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-01-05 09:13:45.928608', 'gRPC error': '2024-01-05 09:13:55.930525'}> (decorators.py:145)

zhuwenxing avatar Jan 05 '24 09:01 zhuwenxing

failed job:https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29811/2/pipeline/199


[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.027154544477948565, 0.12520229297895016, 0.05025134566586025, 0.06978392276226772, 0.035489844523688835, 0.09389808285370721, 0.02850963396626388, 0.02536066202838759, 0.14153427606519733, 0.024387341933807778, 0.015147580294618384, 0.11688336802589201, 0.029718463651159828, 0.01661791774801437......, kwargs: {} (api_request.py:62)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>, <Time:{'RPC start': '2024-01-09 12:53:06.766110', 'RPC error': '2024-01-09 12:53:06.802371'}> (decorators.py:134)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>, <Time:{'RPC start': '2024-01-09 12:53:06.771545', 'RPC error': '2024-01-09 12:53:06.807714'}> (decorators.py:134)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-09T12:53:26.031Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-09T12:53:26.031Z]     res = func(*args, **_kwargs)

[2024-01-09T12:53:26.031Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-09T12:53:26.031Z]     return func(*arg, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 1067, in query

[2024-01-09T12:53:26.031Z]     return conn.query(

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-09T12:53:26.031Z]     raise e from e

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-09T12:53:26.031Z]     return func(*args, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-09T12:53:26.031Z]     return func(self, *args, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-09T12:53:26.031Z]     raise e from e

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-09T12:53:26.031Z]     return func(*args, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1419, in query

[2024-01-09T12:53:26.031Z]     check_status(response.status)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 57, in check_status

[2024-01-09T12:53:26.031Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-09T12:53:26.031Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>

[2024-01-09T12:53:26.031Z]  (api_request.py:45)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-09T12:53:26.031Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-09T12:53:26.031Z]     res = func(*args, **_kwargs)

[2024-01-09T12:53:26.031Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-09T12:53:26.031Z]     return func(*arg, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 792, in search

[2024-01-09T12:53:26.031Z]     resp = conn.search(

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-09T12:53:26.031Z]     raise e from e

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-09T12:53:26.031Z]     return func(*args, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-09T12:53:26.031Z]     return func(self, *args, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-09T12:53:26.031Z]     raise e from e

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-09T12:53:26.031Z]     return func(*args, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 783, in search

[2024-01-09T12:53:26.031Z]     return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 725, in _execute_search

[2024-01-09T12:53:26.031Z]     raise e from e

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 718, in _execute_search

[2024-01-09T12:53:26.031Z]     check_status(response.status)

[2024-01-09T12:53:26.031Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 57, in check_status

[2024-01-09T12:53:26.031Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-09T12:53:26.031Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>

[2024-01-09T12:53:26.031Z]  (api_request.py:45)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])> (api_request.py:46)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])> (api_request.py:46)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: insert request record cost 3.6461278796195984e-06s (checker.py:258)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: insert request record cost 2.505723387002945e-06s (checker.py:258)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: [test][start time:2024-01-09 12:53:06.759569][time cost:0.06895207s][operation_name:query][collection name:Checker__ElB9y6dx] -> False (checker.py:261)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: [test][start time:2024-01-09 12:53:06.759752][time cost:0.06891753s][operation_name:search][collection name:Checker__ElB9y6dx] -> False (checker.py:261)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: (api_request)  : [Collection.upsert] args: [[[-5250073737105031671, -7156249810310586555, 5742609856070236170, 2137084026310057013, 2389945198516488118, -8405276148448322549, -3621118634351277999, -2361572354135461881, -8573848698399142562, 7542015023255499063, 8184442585454993049, -2998748970170380672, 7794648042199478241, 41628430481198196......, kwargs: {} (api_request.py:62)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: insert request record cost 8.910894393920898e-06s (checker.py:258)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: [test][start time:2024-01-09 12:53:07.072712][time cost:0.43570347s][operation_name:upsert_entities][collection name:Checker__ElB9y6dx] -> True (checker.py:261)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ['int64 in [2245, 125, 498, 658, 1015]', None, None, 10], kwargs: {} (api_request.py:62)

[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.010202332970122314, 0.10174689438710105, 0.1439139058981663, 0.021525008594843686, 0.07430045836380449, 0.06294088779946858, 0.10487077252183924, 0.12413347529814714, 0.08727031361106562, 0.0841775004391493, 0.010932295839492335, 0.002869607244775029, 0.049372355285808654, 0.07323830014257975, ......, kwargs: {} (api_request.py:62)

[2024-01-09T12:53:26.032Z] [2024-01-09 12:53:07 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>, <Time:{'RPC start': '2024-01-09 12:53:07.831771', 'RPC error': '2024-01-09 12:53:07.835655'}> (decorators.py:134)

zhuwenxing avatar Jan 10 '24 02:01 zhuwenxing

This is also reproduced in the 2.3 branch pr: https://github.com/milvus-io/milvus/pull/29826 failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29826/1/pipeline/199

zhuwenxing avatar Jan 11 '24 02:01 zhuwenxing

Hi, @congqixia There are still some searches/queries that failed with errors in the master branch ci (I have already reduced the update api frequence)


[2024-01-18T10:32:55.358Z] [2024-01-18 10:32:40 - ERROR - ci_test]: (api_response) : <_InactiveRpcError of RPC that terminated with:

[2024-01-18T10:32:55.358Z] 	status = StatusCode.DEADLINE_EXCEEDED

[2024-01-18T10:32:55.358Z] 	details = "Deadline Exceeded"

[2024-01-18T10:32:55.358Z] 	debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-01-18T10:32:40.818480976+00:00", grpc_status:4}"

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29811/14/pipeline

zhuwenxing avatar Jan 19 '24 02:01 zhuwenxing

When load, search, query, and upsert concurrently, milvus reboots and search/query reports an error.

image : master-20240118-ddccccbc-amd64

argo task : new-stable-master-1705600800 , id: 7 client pod: new-stable-master-1705600800-480296231 (qa namespace, 4am cluster) client error log:

[2024-01-18 20:01:17,588 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])>, <Time:{'RPC start': '2024-01-18 20:00:56.316724', 'RPC error': '2024-01-18 20:01:17.588806'}> (decorators.py:128)
[2024-01-18 20:01:17,590 - ERROR - fouram]: (api_response) : <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])> (api_request.py:54)
[2024-01-18 20:01:17,590 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])> (func_check.py:48)
[2024-01-18 20:01:17,591 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_0_447113067173773664v0])>, <Time:{'RPC start': '2024-01-18 20:00:56.319906', 'RPC error': '2024-01-18 20:01:17.591156'}> (decorators.py:128)
[2024-01-18 20:01:17,591 - ERROR - fouram]: (api_response) : <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_0_447113067173773664v0])> (api_request.py:54)
[2024-01-18 20:01:17,591 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_0_447113067173773664v0])> (func_check.py:48)
[2024-01-18 20:01:17,591 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])>, <Time:{'RPC start': '2024-01-18 20:00:56.325522', 'RPC error': '2024-01-18 20:01:17.591717'}> (decorators.py:128)

server (milvus rebooted 7 times):

new-stable-mast00800-7-42-6968-etcd-0                             1/1     Running           0                2m23s   10.104.15.47    4am-node20   <none>           <none>
new-stable-mast00800-7-42-6968-milvus-standalone-7544ccfcfzlnzj   1/1     Running           0                2m23s   10.104.26.238   4am-node32   <none>           <none>
new-stable-mast00800-7-42-6968-minio-85795d85c-848jf              1/1     Running           0                2m23s   10.104.26.237   4am-node32   <none>           <none> (base.py:257)
[2024-01-18 23:10:27,598 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|new-stable-mast00800-7-42-6968-milvus|new-stable-mast00800-7-42-6968-minio|new-stable-mast00800-7-42-6968-etcd|new-stable-mast00800-7-42-6968-pulsar|new-stable-mast00800-7-42-6968-kafka'  (util_cmd.py:14)
[2024-01-18 23:10:36,946 -  INFO - fouram]: [CliClient] pod details of release(new-stable-mast00800-7-42-6968): 
 I0118 23:10:28.858835     523 request.go:665] Waited for 1.13933015s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/apiregistration.k8s.io/v1?timeout=32s
NAME                                                              READY   STATUS        RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
new-stable-mast00800-7-42-6968-etcd-0                             1/1     Running       0                5h8m    10.104.15.47    4am-node20   <none>           <none>
new-stable-mast00800-7-42-6968-milvus-standalone-7544ccfcfzlnzj   1/1     Running       7 (166m ago)     5h8m    10.104.26.238   4am-node32   <none>           <none>
new-stable-mast00800-7-42-6968-minio-85795d85c-848jf              1/1     Running       0                5h8m    10.104.26.237   4am-node32   <none>           <none> (cli_client.py:132)

test env: 4am, server : qa-milvus namespace

test steps:

1. create a collection  
  2. build an HNSW index on the vector column
  3. insert 1million vectors
  4. flush collection
  5. build index on vector column with the same parameters  
  6. count the total number of rows
  7. load collection
  8. execute concurrent search, query,load,upsert 
  9. step 8 lasts 5h

elstic avatar Jan 19 '24 02:01 elstic

There is still a 1/3 chance of failure failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29811/16/pipeline

zhuwenxing avatar Jan 22 '24 09:01 zhuwenxing

@congqixia it reproduce recently on pulsar +cluster pipeline, while it not reproduced on the other pipelines. could you please take a look? https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30346/6/pipeline

yanliang567 avatar Jan 31 '24 09:01 yanliang567

failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30406/1/pipeline/253

zhuwenxing avatar Jan 31 '24 09:01 zhuwenxing

still failed for cluster mode failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30780/2/pipeline/148

zhuwenxing avatar Feb 23 '24 06:02 zhuwenxing

/assign @MrPresent-Han

xiaofan-luan avatar Feb 23 '24 07:02 xiaofan-luan

please help on it

xiaofan-luan avatar Feb 23 '24 07:02 xiaofan-luan

some update: This issue has blocked my PR https://github.com/milvus-io/milvus/pull/30341 for almost a month. but today, it passed and was merged into the master.

however, it looks like not a stable pass, it can still fail with a high chance. I already found this blocked some open PR.

For example #30786 https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30786/1/pipeline/253

zhuwenxing avatar Feb 23 '24 10:02 zhuwenxing

Today's PR of mine also failed due to this concurrent test. but the error message changed to failed to search: service unavailable: all available nodes are unreachable

failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30780/3/pipeline/228

zhuwenxing avatar Feb 23 '24 10:02 zhuwenxing

I suppose the workload may be too high: search workload too high-->querynode has no cpu cycle for consuming tt-->query node tt lag a few minutes---search timeout

Metrics regarded matches the logical line above, please help verify whether increase cpu resources for query node can mitigate or resolve this problem or not

MrPresent-Han avatar Feb 27 '24 08:02 MrPresent-Han

According to the results provided by @zhuwenxing , we found that when disabling interim index on query node, the cpu usage of query node can be kept in a very mild range around 1.55 core and the whole query node was kept in a stable and solid status without any timeout failure.

I tried adjust the build pool size of knowhere to 1, which brought some effects and enable the concurrent case run smoothly

MrPresent-Han avatar Feb 29 '24 06:02 MrPresent-Han

@liliu-z @MrPresent-Han what is the default concurrency to build interim index ?

we could set to roundup(core/2)

xiaofan-luan avatar Feb 29 '24 12:02 xiaofan-luan

the growing index building should only take 1 thread. Checking

liliu-z avatar Feb 29 '24 12:02 liliu-z

@liliu-z @MrPresent-Han what is the default concurrency to build interim index ?

we could set to roundup(core/2)

yes, the new added parameter set the concurrency=cpuNum * 0.5

MrPresent-Han avatar Mar 11 '24 06:03 MrPresent-Han

/assign zhuwenxing

please help verify whether the effect of the pr above and see whether this issue can be closed or not?

MrPresent-Han avatar Mar 11 '24 06:03 MrPresent-Han

@zhuwenxing please help to update

yanliang567 avatar Mar 22 '24 09:03 yanliang567

with current resources, it is still reproduced. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-32046/1/pipeline/155/

[2024-04-09T04:25:11.480Z] [2024-04-09 04:23:34 - ERROR - pymilvus.decorators]: grpc RpcError: [hybrid_search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-04-09 04:23:24.967643', 'gRPC error': '2024-04-09 04:23:34.968932'}> (decorators.py:157)
[2024-04-09T04:25:11.480Z] [2024-04-09 04:23:34 - ERROR - ci_test]: Traceback (most recent call last):
[2024-04-09T04:25:11.480Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-04-09T04:25:11.480Z]     res = func(*args, **_kwargs)
[2024-04-09T04:25:11.480Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-04-09T04:25:11.480Z]     return func(*arg, **kwargs)
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 936, in hybrid_search
[2024-04-09T04:25:11.480Z]     resp = conn.hybrid_search(
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 161, in handler
[2024-04-09T04:25:11.480Z]     raise e from e
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 143, in handler
[2024-04-09T04:25:11.480Z]     return func(*args, **kwargs)
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 182, in handler
[2024-04-09T04:25:11.480Z]     return func(self, *args, **kwargs)
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 91, in handler
[2024-04-09T04:25:11.480Z]     raise e from e
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 87, in handler
[2024-04-09T04:25:11.480Z]     return func(*args, **kwargs)
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 843, in hybrid_search
[2024-04-09T04:25:11.480Z]     return self._execute_hybrid_search(
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 754, in _execute_hybrid_search
[2024-04-09T04:25:11.480Z]     raise e from e
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 746, in _execute_hybrid_search
[2024-04-09T04:25:11.480Z]     response = self._stub.HybridSearch(request, timeout=timeout)
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 277, in __call__
[2024-04-09T04:25:11.480Z]     response, ignored_call = self._with_call(
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 332, in _with_call
[2024-04-09T04:25:11.480Z]     return call.result(), call
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 438, in result
[2024-04-09T04:25:11.480Z]     raise self
[2024-04-09T04:25:11.480Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 315, in continuation
[2024-04-09T04:25:11.480Z]     response, call = self._thunk(new_method).with_call(
[2024-04-09T04:25:11.481Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 343, in with_call
[2024-04-09T04:25:11.481Z]     return self._with_call(
[2024-04-09T04:25:11.481Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 332, in _with_call
[2024-04-09T04:25:11.481Z]     return call.result(), call
[2024-04-09T04:25:11.481Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 438, in result
[2024-04-09T04:25:11.481Z]     raise self
[2024-04-09T04:25:11.481Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 315, in continuation
[2024-04-09T04:25:11.481Z]     response, call = self._thunk(new_method).with_call(
[2024-04-09T04:25:11.481Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 1178, in with_call
[2024-04-09T04:25:11.481Z]     return _end_unary_response_blocking(state, call, True, None)
[2024-04-09T04:25:11.481Z]   File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
[2024-04-09T04:25:11.481Z]     raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
[2024-04-09T04:25:11.481Z] grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
[2024-04-09T04:25:11.481Z] 	status = StatusCode.DEADLINE_EXCEEDED
[2024-04-09T04:25:11.481Z] 	details = "Deadline Exceeded"
[2024-04-09T04:25:11.481Z] 	debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-04-09T04:23:34.968570011+00:00", grpc_status:4}"
[2024-04-09T04:25:11.481Z] >
[2024-04-09T04:25:11.481Z]  (api_request.py:45)

zhuwenxing avatar Apr 09 '24 08:04 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 10 '24 06:06 stale[bot]