milvus
milvus copied to clipboard
[Bug]: Search and query failed with the error `_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED` when concurrent insert/upsert/delete/search/query requests are sent
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:PR-29424-20240103-74ad0c040
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - DEBUG - ci_test]: (api_request) : [Collection.query] args: ['int64 in [842, 207, 326, 238, 1924]', None, None, 10], kwargs: {} (api_request.py:62)
[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_15_446764440316416579v0])>, <Time:{'RPC start': '2024-01-03 09:39:15.154966', 'RPC error': '2024-01-03 09:39:15.178422'}> (decorators.py:128)
[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - ERROR - ci_test]: Traceback (most recent call last):
[2024-01-03T09:39:38.084Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-01-03T09:39:38.084Z] res = func(*args, **_kwargs)
[2024-01-03T09:39:38.084Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-01-03T09:39:38.084Z] return func(*arg, **kwargs)
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 915, in query
[2024-01-03T09:39:38.084Z] return conn.query(
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 129, in handler
[2024-01-03T09:39:38.084Z] raise e from e
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 125, in handler
[2024-01-03T09:39:38.084Z] return func(*args, **kwargs)
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 164, in handler
[2024-01-03T09:39:38.084Z] return func(self, *args, **kwargs)
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 104, in handler
[2024-01-03T09:39:38.084Z] raise e from e
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 68, in handler
[2024-01-03T09:39:38.084Z] return func(*args, **kwargs)
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1314, in query
[2024-01-03T09:39:38.084Z] check_status(response.status)
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 54, in check_status
[2024-01-03T09:39:38.084Z] raise MilvusException(status.code, status.reason, status.error_code)
[2024-01-03T09:39:38.084Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_15_446764440316416579v0])>
[2024-01-03T09:39:38.084Z] (api_request.py:45)
[2024-01-03T09:39:38.084Z] [2024-01-03 09:39:15 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_15_446764440316416579v0])> (api_request.py:46)
[2024-01-03T09:39:38.083Z] [2024-01-03 09:39:14 - DEBUG - ci_test]: (api_request) : [Collection.search] args: [[[0.10028257037807405, 0.11010703908291551, 0.1392804370425213, 0.1043137217595161, 0.06713751283051579, 0.012174845508086932, 0.02535379578369703, 0.011733175588496571, 0.023539246659198022, 0.12330800922598613, 0.041244465056191115, 0.08051328049227138, 0.07585501537845386, 0.061646895741733196, ......, kwargs: {} (api_request.py:62)
[2024-01-03T09:39:38.083Z] [2024-01-03 09:39:14 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_0_446764440316416579v1])>, <Time:{'RPC start': '2024-01-03 09:39:14.625148', 'RPC error': '2024-01-03 09:39:14.628038'}> (decorators.py:128)
[2024-01-03T09:39:38.083Z] [2024-01-03 09:39:14 - ERROR - ci_test]: Traceback (most recent call last):
[2024-01-03T09:39:38.083Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-01-03T09:39:38.083Z] res = func(*args, **_kwargs)
[2024-01-03T09:39:38.083Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-01-03T09:39:38.083Z] return func(*arg, **kwargs)
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 782, in search
[2024-01-03T09:39:38.083Z] resp = conn.search(
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 129, in handler
[2024-01-03T09:39:38.083Z] raise e from e
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 125, in handler
[2024-01-03T09:39:38.083Z] return func(*args, **kwargs)
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 164, in handler
[2024-01-03T09:39:38.083Z] return func(self, *args, **kwargs)
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 104, in handler
[2024-01-03T09:39:38.083Z] raise e from e
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 68, in handler
[2024-01-03T09:39:38.083Z] return func(*args, **kwargs)
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 748, in search
[2024-01-03T09:39:38.083Z] return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)
[2024-01-03T09:39:38.083Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 709, in _execute_search
[2024-01-03T09:39:38.083Z] raise e from e
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 702, in _execute_search
[2024-01-03T09:39:38.084Z] check_status(response.status)
[2024-01-03T09:39:38.084Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 54, in check_status
[2024-01-03T09:39:38.084Z] raise MilvusException(status.code, status.reason, status.error_code)
[2024-01-03T09:39:38.084Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-03 09:39:04.031987856 +0000 UTC: node offline[node=3]: channel not available[channel=by-dev-rootcoord-dml_0_446764440316416579v1])>
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed pipeline: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29424/8/pipeline
Anything else?
I don't see pods restarted for the Milvus release
/assign @congqixia /unassign
@congqixia I have rebased the latest commit in the master branch, and the ci still failed. failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29424/14/pipeline/26
[2024-01-05T09:14:11.333Z] [2024-01-05 09:13:55 - ERROR - pymilvus.decorators]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-01-05 09:13:45.928331', 'gRPC error': '2024-01-05 09:13:55.930154'}> (decorators.py:145)
[2024-01-05T09:14:11.333Z] [2024-01-05 09:13:55 - ERROR - pymilvus.decorators]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-01-05 09:13:45.928608', 'gRPC error': '2024-01-05 09:13:55.930525'}> (decorators.py:145)
failed job:https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29811/2/pipeline/199
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: (api_request) : [Collection.search] args: [[[0.027154544477948565, 0.12520229297895016, 0.05025134566586025, 0.06978392276226772, 0.035489844523688835, 0.09389808285370721, 0.02850963396626388, 0.02536066202838759, 0.14153427606519733, 0.024387341933807778, 0.015147580294618384, 0.11688336802589201, 0.029718463651159828, 0.01661791774801437......, kwargs: {} (api_request.py:62)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>, <Time:{'RPC start': '2024-01-09 12:53:06.766110', 'RPC error': '2024-01-09 12:53:06.802371'}> (decorators.py:134)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>, <Time:{'RPC start': '2024-01-09 12:53:06.771545', 'RPC error': '2024-01-09 12:53:06.807714'}> (decorators.py:134)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: Traceback (most recent call last):
[2024-01-09T12:53:26.031Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-01-09T12:53:26.031Z] res = func(*args, **_kwargs)
[2024-01-09T12:53:26.031Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-01-09T12:53:26.031Z] return func(*arg, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 1067, in query
[2024-01-09T12:53:26.031Z] return conn.query(
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
[2024-01-09T12:53:26.031Z] raise e from e
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
[2024-01-09T12:53:26.031Z] return func(*args, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
[2024-01-09T12:53:26.031Z] return func(self, *args, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
[2024-01-09T12:53:26.031Z] raise e from e
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
[2024-01-09T12:53:26.031Z] return func(*args, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1419, in query
[2024-01-09T12:53:26.031Z] check_status(response.status)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 57, in check_status
[2024-01-09T12:53:26.031Z] raise MilvusException(status.code, status.reason, status.error_code)
[2024-01-09T12:53:26.031Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>
[2024-01-09T12:53:26.031Z] (api_request.py:45)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: Traceback (most recent call last):
[2024-01-09T12:53:26.031Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-01-09T12:53:26.031Z] res = func(*args, **_kwargs)
[2024-01-09T12:53:26.031Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-01-09T12:53:26.031Z] return func(*arg, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 792, in search
[2024-01-09T12:53:26.031Z] resp = conn.search(
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
[2024-01-09T12:53:26.031Z] raise e from e
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
[2024-01-09T12:53:26.031Z] return func(*args, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
[2024-01-09T12:53:26.031Z] return func(self, *args, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
[2024-01-09T12:53:26.031Z] raise e from e
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
[2024-01-09T12:53:26.031Z] return func(*args, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 783, in search
[2024-01-09T12:53:26.031Z] return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 725, in _execute_search
[2024-01-09T12:53:26.031Z] raise e from e
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 718, in _execute_search
[2024-01-09T12:53:26.031Z] check_status(response.status)
[2024-01-09T12:53:26.031Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 57, in check_status
[2024-01-09T12:53:26.031Z] raise MilvusException(status.code, status.reason, status.error_code)
[2024-01-09T12:53:26.031Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>
[2024-01-09T12:53:26.031Z] (api_request.py:45)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])> (api_request.py:46)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to search: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])> (api_request.py:46)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: insert request record cost 3.6461278796195984e-06s (checker.py:258)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: insert request record cost 2.505723387002945e-06s (checker.py:258)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: [test][start time:2024-01-09 12:53:06.759569][time cost:0.06895207s][operation_name:query][collection name:Checker__ElB9y6dx] -> False (checker.py:261)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:06 - DEBUG - ci_test]: [test][start time:2024-01-09 12:53:06.759752][time cost:0.06891753s][operation_name:search][collection name:Checker__ElB9y6dx] -> False (checker.py:261)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: (api_request) : [Collection.upsert] args: [[[-5250073737105031671, -7156249810310586555, 5742609856070236170, 2137084026310057013, 2389945198516488118, -8405276148448322549, -3621118634351277999, -2361572354135461881, -8573848698399142562, 7542015023255499063, 8184442585454993049, -2998748970170380672, 7794648042199478241, 41628430481198196......, kwargs: {} (api_request.py:62)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: insert request record cost 8.910894393920898e-06s (checker.py:258)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: [test][start time:2024-01-09 12:53:07.072712][time cost:0.43570347s][operation_name:upsert_entities][collection name:Checker__ElB9y6dx] -> True (checker.py:261)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: (api_request) : [Collection.query] args: ['int64 in [2245, 125, 498, 658, 1015]', None, None, 10], kwargs: {} (api_request.py:62)
[2024-01-09T12:53:26.031Z] [2024-01-09 12:53:07 - DEBUG - ci_test]: (api_request) : [Collection.search] args: [[[0.010202332970122314, 0.10174689438710105, 0.1439139058981663, 0.021525008594843686, 0.07430045836380449, 0.06294088779946858, 0.10487077252183924, 0.12413347529814714, 0.08727031361106562, 0.0841775004391493, 0.010932295839492335, 0.002869607244775029, 0.049372355285808654, 0.07323830014257975, ......, kwargs: {} (api_request.py:62)
[2024-01-09T12:53:26.032Z] [2024-01-09 12:53:07 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: leader not available: lastHB=2024-01-09 12:52:55.136945697 +0000 UTC: node offline[node=1]: channel not available[channel=by-dev-rootcoord-dml_8_446903399438225462v0])>, <Time:{'RPC start': '2024-01-09 12:53:07.831771', 'RPC error': '2024-01-09 12:53:07.835655'}> (decorators.py:134)
This is also reproduced in the 2.3 branch pr: https://github.com/milvus-io/milvus/pull/29826 failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29826/1/pipeline/199
Hi, @congqixia There are still some searches/queries that failed with errors in the master branch ci (I have already reduced the update api frequence)
[2024-01-18T10:32:55.358Z] [2024-01-18 10:32:40 - ERROR - ci_test]: (api_response) : <_InactiveRpcError of RPC that terminated with:
[2024-01-18T10:32:55.358Z] status = StatusCode.DEADLINE_EXCEEDED
[2024-01-18T10:32:55.358Z] details = "Deadline Exceeded"
[2024-01-18T10:32:55.358Z] debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-01-18T10:32:40.818480976+00:00", grpc_status:4}"
failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29811/14/pipeline
When load, search, query, and upsert concurrently, milvus reboots and search/query reports an error.
image : master-20240118-ddccccbc-amd64
argo task : new-stable-master-1705600800 , id: 7 client pod: new-stable-master-1705600800-480296231 (qa namespace, 4am cluster) client error log:
[2024-01-18 20:01:17,588 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])>, <Time:{'RPC start': '2024-01-18 20:00:56.316724', 'RPC error': '2024-01-18 20:01:17.588806'}> (decorators.py:128)
[2024-01-18 20:01:17,590 - ERROR - fouram]: (api_response) : <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])> (api_request.py:54)
[2024-01-18 20:01:17,590 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])> (func_check.py:48)
[2024-01-18 20:01:17,591 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_0_447113067173773664v0])>, <Time:{'RPC start': '2024-01-18 20:00:56.319906', 'RPC error': '2024-01-18 20:01:17.591156'}> (decorators.py:128)
[2024-01-18 20:01:17,591 - ERROR - fouram]: (api_response) : <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_0_447113067173773664v0])> (api_request.py:54)
[2024-01-18 20:01:17,591 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_0_447113067173773664v0])> (func_check.py:48)
[2024-01-18 20:01:17,591 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_1_447113067173773664v1])>, <Time:{'RPC start': '2024-01-18 20:00:56.325522', 'RPC error': '2024-01-18 20:01:17.591717'}> (decorators.py:128)
server (milvus rebooted 7 times):
new-stable-mast00800-7-42-6968-etcd-0 1/1 Running 0 2m23s 10.104.15.47 4am-node20 <none> <none>
new-stable-mast00800-7-42-6968-milvus-standalone-7544ccfcfzlnzj 1/1 Running 0 2m23s 10.104.26.238 4am-node32 <none> <none>
new-stable-mast00800-7-42-6968-minio-85795d85c-848jf 1/1 Running 0 2m23s 10.104.26.237 4am-node32 <none> <none> (base.py:257)
[2024-01-18 23:10:27,598 - INFO - fouram]: [Cmd Exe] kubectl get pods -n qa-milvus -o wide | grep -E 'STATUS|new-stable-mast00800-7-42-6968-milvus|new-stable-mast00800-7-42-6968-minio|new-stable-mast00800-7-42-6968-etcd|new-stable-mast00800-7-42-6968-pulsar|new-stable-mast00800-7-42-6968-kafka' (util_cmd.py:14)
[2024-01-18 23:10:36,946 - INFO - fouram]: [CliClient] pod details of release(new-stable-mast00800-7-42-6968):
I0118 23:10:28.858835 523 request.go:665] Waited for 1.13933015s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/apiregistration.k8s.io/v1?timeout=32s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
new-stable-mast00800-7-42-6968-etcd-0 1/1 Running 0 5h8m 10.104.15.47 4am-node20 <none> <none>
new-stable-mast00800-7-42-6968-milvus-standalone-7544ccfcfzlnzj 1/1 Running 7 (166m ago) 5h8m 10.104.26.238 4am-node32 <none> <none>
new-stable-mast00800-7-42-6968-minio-85795d85c-848jf 1/1 Running 0 5h8m 10.104.26.237 4am-node32 <none> <none> (cli_client.py:132)
test env: 4am, server : qa-milvus namespace
test steps:
1. create a collection
2. build an HNSW index on the vector column
3. insert 1million vectors
4. flush collection
5. build index on vector column with the same parameters
6. count the total number of rows
7. load collection
8. execute concurrent search, query,load,upsert
9. step 8 lasts 5h
There is still a 1/3 chance of failure failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-29811/16/pipeline
@congqixia it reproduce recently on pulsar +cluster pipeline, while it not reproduced on the other pipelines. could you please take a look? https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30346/6/pipeline
failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30406/1/pipeline/253
still failed for cluster mode failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30780/2/pipeline/148
/assign @MrPresent-Han
please help on it
some update: This issue has blocked my PR https://github.com/milvus-io/milvus/pull/30341 for almost a month. but today, it passed and was merged into the master.
however, it looks like not a stable pass, it can still fail with a high chance. I already found this blocked some open PR.
For example #30786 https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30786/1/pipeline/253
Today's PR of mine also failed due to this concurrent test.
but the error message changed to failed to search: service unavailable: all available nodes are unreachable
failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-30780/3/pipeline/228
I suppose the workload may be too high: search workload too high-->querynode has no cpu cycle for consuming tt-->query node tt lag a few minutes---search timeout
Metrics regarded matches the logical line above, please help verify whether increase cpu resources for query node can mitigate or resolve this problem or not
According to the results provided by @zhuwenxing , we found that when disabling interim index on query node, the cpu usage of query node can be kept in a very mild range around 1.55 core and the whole query node was kept in a stable and solid status without any timeout failure.
I tried adjust the build pool size of knowhere to 1, which brought some effects and enable the concurrent case run smoothly
@liliu-z @MrPresent-Han what is the default concurrency to build interim index ?
we could set to roundup(core/2)
the growing index building should only take 1 thread. Checking
@liliu-z @MrPresent-Han what is the default concurrency to build interim index ?
we could set to roundup(core/2)
yes, the new added parameter set the concurrency=cpuNum * 0.5
/assign zhuwenxing
please help verify whether the effect of the pr above and see whether this issue can be closed or not?
@zhuwenxing please help to update
with current resources, it is still reproduced. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-32046/1/pipeline/155/
[2024-04-09T04:25:11.480Z] [2024-04-09 04:23:34 - ERROR - pymilvus.decorators]: grpc RpcError: [hybrid_search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-04-09 04:23:24.967643', 'gRPC error': '2024-04-09 04:23:34.968932'}> (decorators.py:157)
[2024-04-09T04:25:11.480Z] [2024-04-09 04:23:34 - ERROR - ci_test]: Traceback (most recent call last):
[2024-04-09T04:25:11.480Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-04-09T04:25:11.480Z] res = func(*args, **_kwargs)
[2024-04-09T04:25:11.480Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-04-09T04:25:11.480Z] return func(*arg, **kwargs)
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 936, in hybrid_search
[2024-04-09T04:25:11.480Z] resp = conn.hybrid_search(
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 161, in handler
[2024-04-09T04:25:11.480Z] raise e from e
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 143, in handler
[2024-04-09T04:25:11.480Z] return func(*args, **kwargs)
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 182, in handler
[2024-04-09T04:25:11.480Z] return func(self, *args, **kwargs)
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 91, in handler
[2024-04-09T04:25:11.480Z] raise e from e
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 87, in handler
[2024-04-09T04:25:11.480Z] return func(*args, **kwargs)
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 843, in hybrid_search
[2024-04-09T04:25:11.480Z] return self._execute_hybrid_search(
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 754, in _execute_hybrid_search
[2024-04-09T04:25:11.480Z] raise e from e
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 746, in _execute_hybrid_search
[2024-04-09T04:25:11.480Z] response = self._stub.HybridSearch(request, timeout=timeout)
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 277, in __call__
[2024-04-09T04:25:11.480Z] response, ignored_call = self._with_call(
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 332, in _with_call
[2024-04-09T04:25:11.480Z] return call.result(), call
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 438, in result
[2024-04-09T04:25:11.480Z] raise self
[2024-04-09T04:25:11.480Z] File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 315, in continuation
[2024-04-09T04:25:11.480Z] response, call = self._thunk(new_method).with_call(
[2024-04-09T04:25:11.481Z] File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 343, in with_call
[2024-04-09T04:25:11.481Z] return self._with_call(
[2024-04-09T04:25:11.481Z] File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 332, in _with_call
[2024-04-09T04:25:11.481Z] return call.result(), call
[2024-04-09T04:25:11.481Z] File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 438, in result
[2024-04-09T04:25:11.481Z] raise self
[2024-04-09T04:25:11.481Z] File "/usr/local/lib/python3.8/site-packages/grpc/_interceptor.py", line 315, in continuation
[2024-04-09T04:25:11.481Z] response, call = self._thunk(new_method).with_call(
[2024-04-09T04:25:11.481Z] File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 1178, in with_call
[2024-04-09T04:25:11.481Z] return _end_unary_response_blocking(state, call, True, None)
[2024-04-09T04:25:11.481Z] File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
[2024-04-09T04:25:11.481Z] raise _InactiveRpcError(state) # pytype: disable=not-instantiable
[2024-04-09T04:25:11.481Z] grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
[2024-04-09T04:25:11.481Z] status = StatusCode.DEADLINE_EXCEEDED
[2024-04-09T04:25:11.481Z] details = "Deadline Exceeded"
[2024-04-09T04:25:11.481Z] debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-04-09T04:23:34.968570011+00:00", grpc_status:4}"
[2024-04-09T04:25:11.481Z] >
[2024-04-09T04:25:11.481Z] (api_request.py:45)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.