milvus
milvus copied to clipboard
[Bug]: [perf][cluster]Milvus concurrent search with error "StatusCode.DEADLINE_EXCEEDED"
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:master-20230328-dc6d4b91
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):2.3.0.dev45
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
release_name_prefix: perf-cluster-master-1680057000 deploy_config: fouramf-server-cluster-8c16m case_params: fouramf-client-gist1m-concurrent-hnsw other_params: --milvus_tag_prefix=master -s --deploy_mode=cluster case_name: test_concurrent_locust_custom_parameters
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
perf-cluster-ma57000-2-14-5118-etcd-0 1/1 Running 0 4h26m 10.104.5.128 4am-node12 <none> <none>
perf-cluster-ma57000-2-14-5118-etcd-1 1/1 Running 0 4h26m 10.104.6.6 4am-node13 <none> <none>
perf-cluster-ma57000-2-14-5118-etcd-2 1/1 Running 0 4h26m 10.104.4.213 4am-node11 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-datacoord-7786b55d67259nw 1/1 Running 2 (4h18m ago) 4h26m 10.104.12.239 4am-node17 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-datanode-7c55976cb4-8hsst 1/1 Running 2 (4h18m ago) 4h26m 10.104.1.94 4am-node10 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-indexcoord-74fb69d6ftb24j 1/1 Running 0 4h26m 10.104.14.57 4am-node18 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-indexnode-68d84f7d8ff2vdv 1/1 Running 0 4h26m 10.104.12.237 4am-node17 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-proxy-6f9d6f76-sbsd2 1/1 Running 2 (4h18m ago) 4h26m 10.104.12.238 4am-node17 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-querycoord-77bcd557dmk2r7 1/1 Running 2 (4h18m ago) 4h26m 10.104.12.241 4am-node17 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-querynode-6c7cff9c47z6crh 1/1 Running 0 4h26m 10.104.13.217 4am-node16 <none> <none>
perf-cluster-ma57000-2-14-5118-milvus-rootcoord-888487d65-mzkh8 1/1 Running 2 (4h18m ago) 4h26m 10.104.12.240 4am-node17 <none> <none>
perf-cluster-ma57000-2-14-5118-minio-0 1/1 Running 0 4h26m 10.104.9.63 4am-node14 <none> <none>
perf-cluster-ma57000-2-14-5118-minio-1 1/1 Running 0 4h26m 10.104.5.127 4am-node12 <none> <none>
perf-cluster-ma57000-2-14-5118-minio-2 1/1 Running 0 4h26m 10.104.6.5 4am-node13 <none> <none>
perf-cluster-ma57000-2-14-5118-minio-3 1/1 Running 0 4h26m 10.104.4.211 4am-node11 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-0 1/1 Running 0 4h26m 10.104.9.69 4am-node14 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-1 1/1 Running 0 4h26m 10.104.6.8 4am-node13 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-2 1/1 Running 0 4h26m 10.104.1.120 4am-node10 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-init-w458g 0/1 Completed 0 4h26m 10.104.5.106 4am-node12 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-broker-0 1/1 Running 0 4h26m 10.104.5.110 4am-node12 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-proxy-0 1/1 Running 0 4h26m 10.104.5.109 4am-node12 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-pulsar-init-ckshc 0/1 Completed 0 4h26m 10.104.5.107 4am-node12 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-recovery-0 1/1 Running 0 4h26m 10.104.5.105 4am-node12 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-zookeeper-0 1/1 Running 0 4h26m 10.104.9.66 4am-node14 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-zookeeper-1 1/1 Running 0 4h24m 10.104.1.128 4am-node10 <none> <none>
perf-cluster-ma57000-2-14-5118-pulsar-zookeeper-2 1/1 Running 0 4h22m 10.104.5.156 4am-node12 <none> <none>
client:
[2023-03-29 03:04:22,463 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)>, <Time:{'RPC start': '2023-03-29 03:03:22.461514', 'RPC error': '2023-03-29 03:04:22.463570'}> (decorators.py:108)
[2023-03-29 03:04:22,465 - ERROR - fouram]: Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
raise pre_err
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 475, in _execute_search_requests
response = self._stub.Search(request, timeout=timeout)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
res = func(*args, **kwargs)
File "/src/fouram/client/util/api_request.py", line 70, in api_request
return func(*arg, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
res = conn.search(self._name, data, anns_field, param, limit, expr,
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 58, in handler
raise MilvusException(message=str(e)) from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)>
(api_request.py:48)
[2023-03-29 03:04:22,465 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)> (api_request.py:49)
[2023-03-29 03:04:22,465 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)> (func_check.py:43)
Expected Behavior
No response
Steps To Reproduce
1. create a collection
2. build hnsw index on vector column
3. insert 1m vectors
4. flush collection
5. build index on vector column with the same parameters
6. count the total number of rows
7. load collection
8. concurrent search
10. clean all collections or not
Milvus Log
No response
Anything else?
'client': {'test_case_type': 'ConcurrentClientBase',
'test_case_name': 'test_concurrent_locust_custom_parameters',
'test_case_params': {'dataset_params': {'dim': 768,
'dataset_name': 'gist',
'dataset_size': '1m',
'ni_per': 1000,
'metric_type': 'L2'},
'collection_params': {'other_fields': []},
'load_params': {},
'search_params': {},
'index_params': {'index_type': 'HNSW',
'index_param': {'M': 8,
'efConstruction': 200}},
'concurrent_params': {'concurrent_number': 1,
'during_time': 3600,
'interval': 20,
'spawn_rate': None},
'concurrent_tasks': [{'type': 'search',
'weight': 1,
'params': {'nq': 1,
'top_k': 1,
'search_param': {'ef': 16},
'random_data': True}}]},
/assign @jiaoew1991 /unassign
/assign @XuanYang-cn /unassign
release_name_prefix perf-cluster-master-1680652800 deploy_config fouramf-server-cluster-8c16m case_params fouramf-client-gist1m-concurrent-hnsw other_params --milvus_tag_prefix=master -s --deploy_mode=cluster case_name test_concurrent_locust_custom_parameters
server:
perf-cluster-ma52800-2-96-4095-pulsar-zookeeper-2 1/1 Running 0 4h17m 10.104.5.51 4am-node12 <none> <none> (cli_client.py:130)
--
| | perf-cluster-ma52800-2-96-4095-pulsar-zookeeper-1 1/1 Running 0 4h18m 10.104.9.129 4am-node14 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-zookeeper-0 1/1 Running 0 4h21m 10.104.1.184 4am-node10 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-recovery-0 1/1 Running 0 4h21m 10.104.4.51 4am-node11 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-pulsar-init-cvsdv 0/1 Completed 0 4h21m 10.104.9.65 4am-node14 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-proxy-0 1/1 Running 0 4h21m 10.104.9.66 4am-node14 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-broker-0 1/1 Running 0 4h21m 10.104.6.228 4am-node13 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-bookie-init-52rgc 0/1 Completed 0 4h21m 10.104.9.64 4am-node14 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-bookie-2 1/1 Running 0 4h21m 10.104.6.11 4am-node13 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-bookie-1 1/1 Running 0 4h21m 10.104.1.190 4am-node10 <none> <none>
| | perf-cluster-ma52800-2-96-4095-pulsar-bookie-0 1/1 Running 0 4h21m 10.104.5.36 4am-node12 <none> <none>
| | perf-cluster-ma52800-2-96-4095-minio-3 1/1 Running 0 4h21m 10.104.1.192 4am-node10 <none> <none>
| | perf-cluster-ma52800-2-96-4095-minio-2 1/1 Running 0 4h21m 10.104.4.82 4am-node11 <none> <none>
| | perf-cluster-ma52800-2-96-4095-minio-1 1/1 Running 0 4h21m 10.104.9.122 4am-node14 <none> <none>
| | perf-cluster-ma52800-2-96-4095-minio-0 1/1 Running 0 4h21m 10.104.5.38 4am-node12 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-rootcoord-5f857cf9c9x9v8h 1/1 Running 1 (4h17m ago) 4h21m 10.104.12.77 4am-node17 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-querynode-88659d889-8lzqq 1/1 Running 36 (7m31s ago) 4h21m 10.104.13.70 4am-node16 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-querycoord-5579f65cf2jn4j 1/1 Running 1 (4h17m ago) 4h21m 10.104.12.74 4am-node17 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-proxy-76bb4f9bb-5v42b 1/1 Running 1 (4h17m ago) 4h21m 10.104.5.20 4am-node12 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-indexnode-5584fb8695fzgwp 1/1 Running 1 (4h17m ago) 4h21m 10.104.14.202 4am-node18 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-indexcoord-77b64c69bknl5b 1/1 Running 0 4h21m 10.104.12.76 4am-node17 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-datanode-7556567d68-bs29t 1/1 Running 1 (4h17m ago) 4h21m 10.104.12.75 4am-node17 <none> <none>
| | perf-cluster-ma52800-2-96-4095-milvus-datacoord-b4877ddf9-r8pbn 1/1 Running 1 (4h17m ago) 4h21m 10.104.12.73 4am-node17 <none> <none>
| | perf-cluster-ma52800-2-96-4095-etcd-2 1/1 Running 0 4h21m 10.104.9.123 4am-node14 <none> <none>
| | perf-cluster-ma52800-2-96-4095-etcd-1 1/1 Running 0 4h21m 10.104.4.81 4am-node11 <none> <none>
| | perf-cluster-ma52800-2-96-4095-etcd-0 1/1 Running 0 4h21m 10.104.6.8 4am-node13 <none> <none>
server:
[2023-04-05 04:21:40,562 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_1_440573140266123379v1 is not available in any replica, err=LackSegment(segmentID=440573140266324657): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): context done during sleep after run#6: context deadline exceeded)> (func_check.py:43)
--
| | [2023-04-05 04:21:40,562 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_1_440573140266123379v1 is not available in any replica, err=LackSegment(segmentID=440573140266324657): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): context done during sleep after run#6: context deadline exceeded)> (api_request.py:49)
| | (api_request.py:48)
| | pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_1_440573140266123379v1 is not available in any replica, err=LackSegment(segmentID=440573140266324657): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): context done during sleep after run#6: context deadline exceeded)>
QueryNode OOMed and cannot recover properly
Use 16G memory to load 6G data seems not making sense.
QueryNode OOMed and cannot recover properly
Use 16G memory to load 6G data seems not making sense.
why not? HNSW should takes a most 2 times of the original data size
why not? HNSW should takes a most 2 times of the original data size
6G is calculated by sizing tools of hnsw index, already doubled the original data size
Reproduced locally, loading total 5.9G data took about 18G mem peek.
@jingkl According to @yah01 , now we need 3 times of the data size's memory to load indexes, so 6G data might need 18G memory in total.
There're ongoing PRs to fix this: #22984, #23138
Just added config queryNode.ioPoolSize
, the default size is cpuNum * 8
, and at least 32, at most 256. Lower size would consume less memory
xi-ge16 is working on reducing the memory usage of arrow. That is the main reason why it's slow
xi-ge16 is working on reducing the memory usage of arrow. That is the main reason why it consumes so much memory
/unassign /assign @jingkl
release_name_prefix perf-standalone-1683721800 deploy_config fouramf-server-standalone-8c16m case_params fouramf-client-gist1m-concurrent-hnsw other_params -s --client_version=2.4.0 --update_helm_file master-20230510-0c99399f This issue does not appear, close the issue