milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [perf][cluster]Milvus concurrent search with error "StatusCode.DEADLINE_EXCEEDED"

Open jingkl opened this issue 1 year ago • 12 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20230328-dc6d4b91
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.3.0.dev45
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

release_name_prefix: perf-cluster-master-1680057000 deploy_config: fouramf-server-cluster-8c16m case_params: fouramf-client-gist1m-concurrent-hnsw other_params: --milvus_tag_prefix=master -s --deploy_mode=cluster case_name: test_concurrent_locust_custom_parameters

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-cluster-ma57000-2-14-5118-etcd-0                             1/1     Running     0                4h26m   10.104.5.128    4am-node12   <none>           <none>
perf-cluster-ma57000-2-14-5118-etcd-1                             1/1     Running     0                4h26m   10.104.6.6      4am-node13   <none>           <none>
perf-cluster-ma57000-2-14-5118-etcd-2                             1/1     Running     0                4h26m   10.104.4.213    4am-node11   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-datacoord-7786b55d67259nw   1/1     Running     2 (4h18m ago)    4h26m   10.104.12.239   4am-node17   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-datanode-7c55976cb4-8hsst   1/1     Running     2 (4h18m ago)    4h26m   10.104.1.94     4am-node10   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-indexcoord-74fb69d6ftb24j   1/1     Running     0                4h26m   10.104.14.57    4am-node18   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-indexnode-68d84f7d8ff2vdv   1/1     Running     0                4h26m   10.104.12.237   4am-node17   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-proxy-6f9d6f76-sbsd2        1/1     Running     2 (4h18m ago)    4h26m   10.104.12.238   4am-node17   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-querycoord-77bcd557dmk2r7   1/1     Running     2 (4h18m ago)    4h26m   10.104.12.241   4am-node17   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-querynode-6c7cff9c47z6crh   1/1     Running     0                4h26m   10.104.13.217   4am-node16   <none>           <none>
perf-cluster-ma57000-2-14-5118-milvus-rootcoord-888487d65-mzkh8   1/1     Running     2 (4h18m ago)    4h26m   10.104.12.240   4am-node17   <none>           <none>
perf-cluster-ma57000-2-14-5118-minio-0                            1/1     Running     0                4h26m   10.104.9.63     4am-node14   <none>           <none>
perf-cluster-ma57000-2-14-5118-minio-1                            1/1     Running     0                4h26m   10.104.5.127    4am-node12   <none>           <none>
perf-cluster-ma57000-2-14-5118-minio-2                            1/1     Running     0                4h26m   10.104.6.5      4am-node13   <none>           <none>
perf-cluster-ma57000-2-14-5118-minio-3                            1/1     Running     0                4h26m   10.104.4.211    4am-node11   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-0                    1/1     Running     0                4h26m   10.104.9.69     4am-node14   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-1                    1/1     Running     0                4h26m   10.104.6.8      4am-node13   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-2                    1/1     Running     0                4h26m   10.104.1.120    4am-node10   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-bookie-init-w458g           0/1     Completed   0                4h26m   10.104.5.106    4am-node12   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-broker-0                    1/1     Running     0                4h26m   10.104.5.110    4am-node12   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-proxy-0                     1/1     Running     0                4h26m   10.104.5.109    4am-node12   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-pulsar-init-ckshc           0/1     Completed   0                4h26m   10.104.5.107    4am-node12   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-recovery-0                  1/1     Running     0                4h26m   10.104.5.105    4am-node12   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-zookeeper-0                 1/1     Running     0                4h26m   10.104.9.66     4am-node14   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-zookeeper-1                 1/1     Running     0                4h24m   10.104.1.128    4am-node10   <none>           <none>
perf-cluster-ma57000-2-14-5118-pulsar-zookeeper-2                 1/1     Running     0                4h22m   10.104.5.156    4am-node12   <none>           <none> 

client:

[2023-03-29 03:04:22,463 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)>, <Time:{'RPC start': '2023-03-29 03:03:22.461514', 'RPC error': '2023-03-29 03:04:22.463570'}> (decorators.py:108)
[2023-03-29 03:04:22,465 - ERROR - fouram]: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
    return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
    raise pre_err
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 475, in _execute_search_requests
    response = self._stub.Search(request, timeout=timeout)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
    res = func(*args, **kwargs)
  File "/src/fouram/client/util/api_request.py", line 70, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
    res = conn.search(self._name, data, anns_field, param, limit, expr,
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 58, in handler
    raise MilvusException(message=str(e)) from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)>
 (api_request.py:48)
[2023-03-29 03:04:22,465 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)> (api_request.py:49)
[2023-03-29 03:04:22,465 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1680059062.462659056","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>)> (func_check.py:43)

Expected Behavior

No response

Steps To Reproduce

1. create a collection 
        2. build hnsw index on vector column
        3. insert 1m  vectors
        4. flush collection
        5. build index on vector column with the same parameters
        6. count the total number of rows
        7. load collection
        8. concurrent search
        10. clean all collections or not

Milvus Log

No response

Anything else?

 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'dim': 768,
                                                    'dataset_name': 'gist',
                                                    'dataset_size': '1m',
                                                    'ni_per': 1000,
                                                    'metric_type': 'L2'},
                                 'collection_params': {'other_fields': []},
                                 'load_params': {},
                                 'search_params': {},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 1,
                                                       'during_time': 3600,
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 1,
                                                                  'top_k': 1,
                                                                  'search_param': {'ef': 16},
                                                                  'random_data': True}}]},

jingkl avatar Mar 29 '23 07:03 jingkl

/assign @jiaoew1991 /unassign

yanliang567 avatar Mar 30 '23 02:03 yanliang567

/assign @XuanYang-cn /unassign

jiaoew1991 avatar Apr 04 '23 02:04 jiaoew1991

release_name_prefix perf-cluster-master-1680652800 deploy_config fouramf-server-cluster-8c16m case_params fouramf-client-gist1m-concurrent-hnsw other_params --milvus_tag_prefix=master -s --deploy_mode=cluster case_name test_concurrent_locust_custom_parameters

server:

perf-cluster-ma52800-2-96-4095-pulsar-zookeeper-2                 1/1     Running            0                 4h17m   10.104.5.51     4am-node12   <none>           <none> (cli_client.py:130)
--
  |   | perf-cluster-ma52800-2-96-4095-pulsar-zookeeper-1                 1/1     Running            0                 4h18m   10.104.9.129    4am-node14   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-zookeeper-0                 1/1     Running            0                 4h21m   10.104.1.184    4am-node10   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-recovery-0                  1/1     Running            0                 4h21m   10.104.4.51     4am-node11   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-pulsar-init-cvsdv           0/1     Completed          0                 4h21m   10.104.9.65     4am-node14   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-proxy-0                     1/1     Running            0                 4h21m   10.104.9.66     4am-node14   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-broker-0                    1/1     Running            0                 4h21m   10.104.6.228    4am-node13   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-bookie-init-52rgc           0/1     Completed          0                 4h21m   10.104.9.64     4am-node14   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-bookie-2                    1/1     Running            0                 4h21m   10.104.6.11     4am-node13   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-bookie-1                    1/1     Running            0                 4h21m   10.104.1.190    4am-node10   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-pulsar-bookie-0                    1/1     Running            0                 4h21m   10.104.5.36     4am-node12   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-minio-3                            1/1     Running            0                 4h21m   10.104.1.192    4am-node10   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-minio-2                            1/1     Running            0                 4h21m   10.104.4.82     4am-node11   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-minio-1                            1/1     Running            0                 4h21m   10.104.9.122    4am-node14   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-minio-0                            1/1     Running            0                 4h21m   10.104.5.38     4am-node12   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-rootcoord-5f857cf9c9x9v8h   1/1     Running            1 (4h17m ago)     4h21m   10.104.12.77    4am-node17   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-querynode-88659d889-8lzqq   1/1     Running            36 (7m31s ago)    4h21m   10.104.13.70    4am-node16   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-querycoord-5579f65cf2jn4j   1/1     Running            1 (4h17m ago)     4h21m   10.104.12.74    4am-node17   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-proxy-76bb4f9bb-5v42b       1/1     Running            1 (4h17m ago)     4h21m   10.104.5.20     4am-node12   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-indexnode-5584fb8695fzgwp   1/1     Running            1 (4h17m ago)     4h21m   10.104.14.202   4am-node18   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-indexcoord-77b64c69bknl5b   1/1     Running            0                 4h21m   10.104.12.76    4am-node17   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-datanode-7556567d68-bs29t   1/1     Running            1 (4h17m ago)     4h21m   10.104.12.75    4am-node17   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-milvus-datacoord-b4877ddf9-r8pbn   1/1     Running            1 (4h17m ago)     4h21m   10.104.12.73    4am-node17   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-etcd-2                             1/1     Running            0                 4h21m   10.104.9.123    4am-node14   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-etcd-1                             1/1     Running            0                 4h21m   10.104.4.81     4am-node11   <none>           <none>
  |   | perf-cluster-ma52800-2-96-4095-etcd-0                             1/1     Running            0                 4h21m   10.104.6.8      4am-node13   <none>           <none>

server:

[2023-04-05 04:21:40,562 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_1_440573140266123379v1 is not available in any replica, err=LackSegment(segmentID=440573140266324657): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): context done during sleep after run#6: context deadline exceeded)> (func_check.py:43)
--
  |   | [2023-04-05 04:21:40,562 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_1_440573140266123379v1 is not available in any replica, err=LackSegment(segmentID=440573140266324657): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): context done during sleep after run#6: context deadline exceeded)> (api_request.py:49)
  |   | (api_request.py:48)
  |   | pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_1_440573140266123379v1 is not available in any replica, err=LackSegment(segmentID=440573140266324657): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440573140266123379v0 is not available in any replica, err=LackSegment(segmentID=440573140266324405): context done during sleep after run#6: context deadline exceeded)>

jingkl avatar Apr 07 '23 03:04 jingkl

QueryNode OOMed and cannot recover properly

Use 16G memory to load 6G data seems not making sense. image

XuanYang-cn avatar Apr 07 '23 06:04 XuanYang-cn

QueryNode OOMed and cannot recover properly

Use 16G memory to load 6G data seems not making sense. image

why not? HNSW should takes a most 2 times of the original data size

xiaofan-luan avatar Apr 07 '23 16:04 xiaofan-luan

why not? HNSW should takes a most 2 times of the original data size

6G is calculated by sizing tools of hnsw index, already doubled the original data size

XuanYang-cn avatar Apr 10 '23 02:04 XuanYang-cn

Reproduced locally, loading total 5.9G data took about 18G mem peek.

XuanYang-cn avatar Apr 11 '23 05:04 XuanYang-cn

@jingkl According to @yah01 , now we need 3 times of the data size's memory to load indexes, so 6G data might need 18G memory in total.

There're ongoing PRs to fix this: #22984, #23138

XuanYang-cn avatar Apr 11 '23 06:04 XuanYang-cn

Just added config queryNode.ioPoolSize, the default size is cpuNum * 8, and at least 32, at most 256. Lower size would consume less memory

yah01 avatar Apr 13 '23 06:04 yah01

xi-ge16 is working on reducing the memory usage of arrow. That is the main reason why it's slow

xiaofan-luan avatar Apr 13 '23 18:04 xiaofan-luan

xi-ge16 is working on reducing the memory usage of arrow. That is the main reason why it consumes so much memory

xiaofan-luan avatar Apr 13 '23 18:04 xiaofan-luan

/unassign /assign @jingkl

XuanYang-cn avatar Apr 17 '23 02:04 XuanYang-cn

release_name_prefix perf-standalone-1683721800 deploy_config fouramf-server-standalone-8c16m case_params fouramf-client-gist1m-concurrent-hnsw other_params -s --client_version=2.4.0 --update_helm_file master-20230510-0c99399f This issue does not appear, close the issue

jingkl avatar May 11 '23 08:05 jingkl