milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [perf][cluster] Milvus insert 1m data and build hnsw index, then concurrent search error“fail to search on all shard leaders”

Open jingkl opened this issue 1 year ago • 5 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20230401-3b9716bb
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.3.0.dev45
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

release_name_prefix: perf-cluster-master-1680393600 deploy_config: fouramf-server-cluster-8c16m case_params: fouramf-client-sift1m-concurrent-hnsw other_params: --milvus_tag_prefix=master -s --deploy_mode=cluster case_name: test_concurrent_locust_custom_parameters

perf-cluster-ma93600-1-27-5970-etcd-0                             1/1     Running            0                 4h14m   10.104.9.60     4am-node14   <none>           <none>
perf-cluster-ma93600-1-27-5970-etcd-1                             1/1     Running            0                 4h14m   10.104.4.111    4am-node11   <none>           <none>
perf-cluster-ma93600-1-27-5970-etcd-2                             1/1     Running            0                 4h14m   10.104.5.101    4am-node12   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-datacoord-6659b5dcd67kglm   1/1     Running            2 (4h6m ago)      4h14m   10.104.14.172   4am-node18   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-datanode-5748cf4ddb-7xp8w   1/1     Running            2 (4h6m ago)      4h14m   10.104.14.173   4am-node18   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-indexcoord-7d44b5b4c5zgbc   1/1     Running            0                 4h14m   10.104.14.171   4am-node18   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-indexnode-56d4965bf55tlg5   1/1     Running            1 (4h10m ago)     4h14m   10.104.12.233   4am-node17   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-proxy-849db6f44b-4b4k2      1/1     Running            3 (4h3m ago)      4h14m   10.104.12.234   4am-node17   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-querycoord-745b47f7b4szmj   1/1     Running            3 (4h3m ago)      4h14m   10.104.12.235   4am-node17   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-querynode-68bc4f58656fgp6   1/1     Running            2 (3h58m ago)     4h14m   10.104.13.210   4am-node16   <none>           <none>
perf-cluster-ma93600-1-27-5970-milvus-rootcoord-96bb67675-wwtr5   1/1     Running            2 (4h6m ago)      4h14m   10.104.12.232   4am-node17   <none>           <none>
perf-cluster-ma93600-1-27-5970-minio-0                            1/1     Running            0                 4h14m   10.104.4.110    4am-node11   <none>           <none>
perf-cluster-ma93600-1-27-5970-minio-1                            1/1     Running            0                 4h14m   10.104.6.241    4am-node13   <none>           <none>
perf-cluster-ma93600-1-27-5970-minio-2                            1/1     Running            0                 4h14m   10.104.5.102    4am-node12   <none>           <none>
perf-cluster-ma93600-1-27-5970-minio-3                            1/1     Running            0                 4h14m   10.104.1.49     4am-node10   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-bookie-0                    1/1     Running            0                 4h14m   10.104.5.110    4am-node12   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-bookie-1                    1/1     Running            0                 4h14m   10.104.6.2      4am-node13   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-bookie-2                    1/1     Running            0                 4h14m   10.104.1.70     4am-node10   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-bookie-init-9kz4j           0/1     Completed          0                 4h14m   10.104.1.27     4am-node10   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-broker-0                    1/1     Running            0                 4h14m   10.104.6.5      4am-node13   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-proxy-0                     1/1     Running            0                 4h14m   10.104.5.108    4am-node12   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-pulsar-init-m5wh2           0/1     Completed          0                 4h14m   10.104.5.76     4am-node12   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-recovery-0                  1/1     Running            0                 4h14m   10.104.6.214    4am-node13   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-zookeeper-0                 1/1     Running            0                 4h14m   10.104.4.107    4am-node11   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-zookeeper-1                 1/1     Running            0                 4h11m   10.104.9.73     4am-node14   <none>           <none>
perf-cluster-ma93600-1-27-5970-pulsar-zookeeper-2                 1/1     Running            0                 4h6m    10.104.1.74     4am-node10   <none>           <none>

querynode:

截屏2023-04-03 16 27 38

client log:

2023-04-02 00:17:07,100 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=target node id not match target id = 5, node id = 11)>, <Time:{'RPC start': '2023-04-02 00:17:07.097074', 'RPC error': '2023-04-02 00:17:07.099947'}> (decorators.py:108)
[2023-04-02 00:17:07,100 - ERROR - fouram]: Traceback (most recent call last):
  File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
    res = func(*args, **kwargs)
  File "/src/fouram/client/util/api_request.py", line 70, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
    res = conn.search(self._name, data, anns_field, param, limit, expr,
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
    return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
    raise pre_err
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests
    raise MilvusException(response.status.error_code, response.status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=target node id not match target id = 5, node id = 11)>
 (api_request.py:48)
[2023-04-02 00:17:07,100 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=target node id not match target id = 5, node id = 11)> (api_request.py:49)
[2023-04-02 00:17:07,100 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=target node id not match target id = 5, node id = 11)> (func_check.py:43)
[2023-04-02 00:17:07,103 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=target node id not match target id = 5, node id = 11)>, <Time:{'RPC start': '2023-04-02 00:17:07.101270', 'RPC error': '2023-04-02 00:17:07.103818'}> (decorators.py:108)
[2023-04-02 00:17:07,104 - ERROR - fouram]: Traceback (most recent call last):
  File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
    res = func(*args, **kwargs)
  File "/src/fouram/client/util/api_request.py", line 70, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
    res = conn.search(self._name, data, anns_field, param, limit, expr,
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
    return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
    raise pre_err
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests
    raise MilvusException(response.status.error_code, response.status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=target node id not match target id = 5, node id = 11)>
 (api_request.py:48)

Expected Behavior

No response

Steps To Reproduce

1. create a collection 
        2. build hsnw index on vector column
        3. insert 1m of vectors
        4. flush collection
        5. build index on vector column with the same parameters
        6. count the total number of rows
        7. load collection
        8. perform concurrent operations
        9. clean all collections or not

Milvus Log

No response

Anything else?

'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'dim': 128,
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 1000000,
                                                    'ni_per': 50000,
                                                    'metric_type': 'L2'},
                                 'collection_params': {'other_fields': []},
                                 'load_params': {},
                                 'search_params': {},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 1,
                                                       'during_time': 3600,
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 1,
                                                                  'top_k': 1,
                                                                  'search_param': {'ef': 16},
                                                                  'random_data': True}}]},

jingkl avatar Apr 03 '23 08:04 jingkl

/assign

smellthemoon avatar Apr 03 '23 08:04 smellthemoon

/unassign @yanliang567

smellthemoon avatar Apr 03 '23 08:04 smellthemoon

1 The restart of querynode resulted in a series of id not match.

smellthemoon avatar Apr 04 '23 02:04 smellthemoon

release_name_prefix: perf-cluster-master-1680566400 deploy_config fouramf-server-cluster-8c16m case_params fouramf-client-sift1m-concurrent-hnsw other_params --milvus_tag_prefix=master -s --deploy_mode=cluster

case_name test_concurrent_locust_custom_parameters

server:

perf-cluster-ma66400-1-71-9223-etcd-0                             1/1     Running            0                 4h16m   10.104.4.148    4am-node11   <none>           <none>
perf-cluster-ma66400-1-71-9223-etcd-1                             1/1     Running            0                 4h16m   10.104.1.44     4am-node10   <none>           <none>
perf-cluster-ma66400-1-71-9223-etcd-2                             1/1     Running            0                 4h16m   10.104.5.223    4am-node12   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-datacoord-85998c68f6jnbjl   1/1     Running            3 (4h4m ago)      4h16m   10.104.14.121   4am-node18   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-datanode-5cb67d48c8-xffwv   1/1     Running            3 (4h6m ago)      4h16m   10.104.13.229   4am-node16   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-indexcoord-f496cb49dfwh2n   1/1     Running            0                 4h16m   10.104.14.120   4am-node18   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-indexnode-6858478959pqf5j   1/1     Running            0                 4h16m   10.104.14.122   4am-node18   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-proxy-6fbd6645f4-xcndx      1/1     Running            3 (4h5m ago)      4h16m   10.104.12.201   4am-node17   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-querycoord-8677fc54dqt2rs   1/1     Running            3 (4h5m ago)      4h16m   10.104.14.119   4am-node18   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-querynode-876d8b94f-487v7   1/1     Running            2 (3h55m ago)     4h16m   10.104.12.202   4am-node17   <none>           <none>
perf-cluster-ma66400-1-71-9223-milvus-rootcoord-bd46dd987-m6x4g   1/1     Running            3 (4h5m ago)      4h16m   10.104.14.118   4am-node18   <none>           <none>
perf-cluster-ma66400-1-71-9223-minio-0                            1/1     Running            0                 4h16m   10.104.4.151    4am-node11   <none>           <none>
perf-cluster-ma66400-1-71-9223-minio-1                            1/1     Running            0                 4h16m   10.104.9.213    4am-node14   <none>           <none>
perf-cluster-ma66400-1-71-9223-minio-2                            1/1     Running            0                 4h16m   10.104.5.225    4am-node12   <none>           <none>
perf-cluster-ma66400-1-71-9223-minio-3                            1/1     Running            0                 4h16m   10.104.1.46     4am-node10   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-bookie-0                    1/1     Running            0                 4h16m   10.104.5.3      4am-node12   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-bookie-1                    1/1     Running            0                 4h16m   10.104.6.49     4am-node13   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-bookie-2                    1/1     Running            0                 4h16m   10.104.4.178    4am-node11   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-bookie-init-5rp69           0/1     Completed          0                 4h16m   10.104.1.23     4am-node10   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-broker-0                    1/1     Running            0                 4h16m   10.104.1.64     4am-node10   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-proxy-0                     1/1     Running            0                 4h16m   10.104.5.236    4am-node12   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-pulsar-init-4pqbw           0/1     Completed          0                 4h16m   10.104.9.191    4am-node14   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-recovery-0                  1/1     Running            0                 4h16m   10.104.4.174    4am-node11   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-zookeeper-0                 1/1     Running            0                 4h16m   10.104.4.149    4am-node11   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-zookeeper-1                 1/1     Running            0                 4h14m   10.104.5.242    4am-node12   <none>           <none>
perf-cluster-ma66400-1-71-9223-pulsar-zookeeper-2                 1/1     Running            0                 4h6m    10.104.1.92     4am-node10   <none>           <none>

client log:

[2023-04-04 00:19:10,991 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=2, reason=target node id not match target id = 2, node id = 14)>, <Time:{'RPC start': '2023-04-04 00:19:04.369938', 'RPC error': '2023-04-04 00:19:10.990938'}> (decorators.py:108)
[2023-04-04 00:19:10,992 - ERROR - fouram]: Traceback (most recent call last):
  File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
    res = func(*args, **kwargs)
  File "/src/fouram/client/util/api_request.py", line 70, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
    res = conn.search(self._name, data, anns_field, param, limit, expr,
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
    return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
    raise pre_err
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests
    raise MilvusException(response.status.error_code, response.status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=2, reason=target node id not match target id = 2, node id = 14)>
 (api_request.py:48)

jingkl avatar Apr 06 '23 10:04 jingkl

querynode panic lead to the restart of it. related with #23338 pr has merged. plz help check it @jingkl

smellthemoon avatar Apr 14 '23 08:04 smellthemoon

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar May 15 '23 14:05 stale[bot]