milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark] querynode memory allocation imbalance, resulting in load failure

Open elstic opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20230424-e23af640
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):  kafka  
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

hnsw index, insert 100m data, load fails and reports: "deny to load, insufficient memory, please allocate more resources".

argo task : fouramf-lwq89 case: test_concurrent_locust_100m_hnsw_ddl_dql_filter_kafka_cluster This case memory allocation is previously set, given the 2 querynode memory limit of 64g , enough for 100m data.

server

fouramf-lwq89-90-4387-etcd-0                                      1/1     Running            0               4m23s   10.104.20.15    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-etcd-1                                      1/1     Running            0               4m23s   10.104.18.201   4am-node25   <none>           <none>
fouramf-lwq89-90-4387-etcd-2                                      1/1     Running            0               4m23s   10.104.22.128   4am-node26   <none>           <none>
fouramf-lwq89-90-4387-kafka-0                                     1/1     Running            0               4m23s   10.104.20.17    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-kafka-1                                     1/1     Running            3 (2m59s ago)   4m23s   10.104.5.167    4am-node12   <none>           <none>
fouramf-lwq89-90-4387-kafka-2                                     1/1     Running            3 (3m12s ago)   4m23s   10.104.9.140    4am-node14   <none>           <none>
fouramf-lwq89-90-4387-milvus-datacoord-56f6698fb8-r2cbf           1/1     Running            0               4m23s   10.104.16.22    4am-node21   <none>           <none>
fouramf-lwq89-90-4387-milvus-datanode-888977d79-rzmxh             1/1     Running            0               4m23s   10.104.15.226   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-indexcoord-778b7df58b-x4nml          1/1     Running            0               4m23s   10.104.15.225   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-indexnode-6b7c66d545-nfx7s           1/1     Running            0               4m23s   10.104.15.224   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-proxy-5d75f8c7cb-hr29w               1/1     Running            0               4m23s   10.104.20.10    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-milvus-querycoord-7bf7f7999c-878xq          1/1     Running            0               4m23s   10.104.15.227   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-glhnj           1/1     Running            0               4m23s   10.104.20.11    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-kvft6           1/1     Running            0               4m23s   10.104.16.23    4am-node21   <none>           <none>
fouramf-lwq89-90-4387-milvus-rootcoord-6db8f647ff-sm8rw           1/1     Running            0               4m23s   10.104.15.223   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-minio-0                                     1/1     Running            0               4m23s   10.104.16.25    4am-node21   <none>           <none>
fouramf-lwq89-90-4387-minio-1                                     1/1     Running            0               4m23s   10.104.20.19    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-minio-2                                     1/1     Running            0               4m23s   10.104.18.203   4am-node25   <none>           <none>
fouramf-lwq89-90-4387-minio-3                                     1/1     Running            0               4m23s   10.104.19.139   4am-node28   <none>           <none>
fouramf-lwq89-90-4387-zookeeper-0                                 1/1     Running            0               4m23s   10.104.20.16    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-zookeeper-1                                 1/1     Running            0               4m23s   10.104.18.200   4am-node25   <none>           <none>
fouramf-lwq89-90-4387-zookeeper-2                                 1/1     Running            0               4m23s   10.104.22.122   4am-node26   <none>           <none>

server (querynode restart)

fouramf-lwq89-90-4387-etcd-0                                      1/1     Running     0               161m    10.104.20.15    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-etcd-1                                      1/1     Running     0               161m    10.104.18.201   4am-node25   <none>           <none>
fouramf-lwq89-90-4387-etcd-2                                      1/1     Running     0               161m    10.104.22.128   4am-node26   <none>           <none>
fouramf-lwq89-90-4387-kafka-0                                     1/1     Running     0               161m    10.104.20.17    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-kafka-1                                     1/1     Running     3 (159m ago)    161m    10.104.5.167    4am-node12   <none>           <none>
fouramf-lwq89-90-4387-kafka-2                                     1/1     Running     3 (159m ago)    161m    10.104.9.140    4am-node14   <none>           <none>
fouramf-lwq89-90-4387-milvus-datacoord-56f6698fb8-r2cbf           1/1     Running     0               161m    10.104.16.22    4am-node21   <none>           <none>
fouramf-lwq89-90-4387-milvus-datanode-888977d79-rzmxh             1/1     Running     0               161m    10.104.15.226   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-indexcoord-778b7df58b-x4nml          1/1     Running     0               161m    10.104.15.225   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-indexnode-6b7c66d545-nfx7s           1/1     Running     0               161m    10.104.15.224   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-proxy-5d75f8c7cb-hr29w               1/1     Running     0               161m    10.104.20.10    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-milvus-querycoord-7bf7f7999c-878xq          1/1     Running     0               161m    10.104.15.227   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-glhnj           1/1     Running     1 (11m ago)     161m    10.104.20.11    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-kvft6           1/1     Running     1 (10m ago)     161m    10.104.16.23    4am-node21   <none>           <none>
fouramf-lwq89-90-4387-milvus-rootcoord-6db8f647ff-sm8rw           1/1     Running     0               161m    10.104.15.223   4am-node20   <none>           <none>
fouramf-lwq89-90-4387-minio-0                                     1/1     Running     0               161m    10.104.16.25    4am-node21   <none>           <none>
fouramf-lwq89-90-4387-minio-1                                     1/1     Running     0               161m    10.104.20.19    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-minio-2                                     1/1     Running     0               161m    10.104.18.203   4am-node25   <none>           <none>
fouramf-lwq89-90-4387-minio-3                                     1/1     Running     0               161m    10.104.19.139   4am-node28   <none>           <none>
fouramf-lwq89-90-4387-zookeeper-0                                 1/1     Running     0               161m    10.104.20.16    4am-node22   <none>           <none>
fouramf-lwq89-90-4387-zookeeper-1                                 1/1     Running     0               161m    10.104.18.200   4am-node25   <none>           <none>
fouramf-lwq89-90-4387-zookeeper-2                                 1/1     Running     0               161m    10.104.22.122   4am-node26   <none>           <none> 

client :

[2023-04-24 14:15:08,894 -  INFO - fouram]: [CommonCases] RT of build index HNSW: 4333.0958s (common_cases.py:87)
[2023-04-24 14:15:08,898 -  INFO - fouram]: [Base] Params of index: {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 8, 'efConstruction': 200}} (base.py:297)
[2023-04-24 14:15:08,898 -  INFO - fouram]: [CommonCases] Prepare index HNSW done. (common_cases.py:90)
[2023-04-24 14:15:08,898 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:95)
[2023-04-24 14:15:08,898 -  INFO - fouram]: [Base] Start load collection fouram_2qUsR2gY,replica_number:1,kwargs:{} (base.py:138)
[2023-04-24 14:27:26,018 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_2qUsR2gY)>, <Time:{'RPC start': '2023-04-24 14:27:26.016189', 'RPC error': '2023-04-24 14:27:26.018055'}> (decorators.py:108)
[2023-04-24 14:27:26,020 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_2qUsR2gY)>, <Time:{'RPC start': '2023-04-24 14:15:08.927685', 'RPC error': '2023-04-24 14:27:26.020265'}> (decorators.py:108)
[2023-04-24 14:27:26,020 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_2qUsR2gY)>, <Time:{'RPC start': '2023-04-24 14:15:08.898503', 'RPC error': '2023-04-24 14:27:26.020411'}> (decorators.py:108)
[2023-04-24 14:27:26,021 - ERROR - fouram]: Traceback (most recent call last):
  File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
    res = func(*args, **kwargs)
  File "/src/fouram/client/util/api_request.py", line 70, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 366, in load
    conn.load_collection(self._name, replica_number=replica_number, timeout=timeout, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)

memory uasgae: image

Expected Behavior

2 querynode memory balancing , load sucess

Steps To Reproduce

1. create a collection or use an existing collection
        2. build index on vector column
        3. insert a certain number of vectors
        4. flush collection
        5. build index on vector column with the same parameters
        6. build index on on scalars column or not
        7. count the total number of rows
        8. load collection  ===> failed
        9. perform concurrent operations
        10. clean all collections or not

Milvus Log

No response

Anything else?

No response

elstic avatar Apr 25 '23 03:04 elstic

/assign @congqixia /unassign

yanliang567 avatar Apr 25 '23 06:04 yanliang567

@congqixia I retried and the issue still persists. Used image : master-20230505-38f45c37 argo task: fouramf-gjwdb

server:

fouramf-gjwdb-59-4613-etcd-0                                      1/1     Running     0               3m54s   10.104.6.245    4am-node13   <none>           <none>
fouramf-gjwdb-59-4613-etcd-1                                      1/1     Running     0               3m53s   10.104.22.40    4am-node26   <none>           <none>
fouramf-gjwdb-59-4613-etcd-2                                      1/1     Running     0               3m53s   10.104.23.122   4am-node27   <none>           <none>
fouramf-gjwdb-59-4613-kafka-0                                     1/1     Running     2 (97s ago)     3m54s   10.104.6.244    4am-node13   <none>           <none>
fouramf-gjwdb-59-4613-kafka-1                                     1/1     Running     1 (110s ago)    3m53s   10.104.4.171    4am-node11   <none>           <none>
fouramf-gjwdb-59-4613-kafka-2                                     1/1     Running     1 (106s ago)    3m53s   10.104.16.126   4am-node21   <none>           <none>
fouramf-gjwdb-59-4613-milvus-datacoord-6ddf748f54-x5jjh           1/1     Running     0               3m54s   10.104.6.223    4am-node13   <none>           <none>
fouramf-gjwdb-59-4613-milvus-datanode-744c658cdf-mtxnr            1/1     Running     0               3m53s   10.104.6.222    4am-node13   <none>           <none>
fouramf-gjwdb-59-4613-milvus-indexcoord-656bdb998d-qmtls          1/1     Running     0               3m54s   10.104.22.35    4am-node26   <none>           <none>
fouramf-gjwdb-59-4613-milvus-indexnode-85877fc7f-hprgb            1/1     Running     0               3m54s   10.104.1.82     4am-node10   <none>           <none>
fouramf-gjwdb-59-4613-milvus-proxy-94567d57d-tlknd                1/1     Running     0               3m53s   10.104.6.221    4am-node13   <none>           <none>
fouramf-gjwdb-59-4613-milvus-querycoord-55dfd559bc-q7nth          1/1     Running     0               3m53s   10.104.22.34    4am-node26   <none>           <none>
fouramf-gjwdb-59-4613-milvus-querynode-cddc59-fnndr               1/1     Running     0               3m54s   10.104.9.155    4am-node14   <none>           <none>
fouramf-gjwdb-59-4613-milvus-querynode-cddc59-wllqq               1/1     Running     0               3m54s   10.104.4.158    4am-node11   <none>           <none>
fouramf-gjwdb-59-4613-milvus-rootcoord-6c96878577-r5wnp           1/1     Running     0               3m54s   10.104.1.81     4am-node10   <none>           <none>
fouramf-gjwdb-59-4613-minio-0                                     1/1     Running     0               3m54s   10.104.6.243    4am-node13   <none>           <none>
fouramf-gjwdb-59-4613-minio-1                                     1/1     Running     0               3m53s   10.104.22.38    4am-node26   <none>           <none>
fouramf-gjwdb-59-4613-minio-2                                     1/1     Running     0               3m53s   10.104.15.247   4am-node20   <none>           <none>
fouramf-gjwdb-59-4613-minio-3                                     1/1     Running     0               3m53s   10.104.23.125   4am-node27   <none>           <none>
fouramf-gjwdb-59-4613-zookeeper-0                                 1/1     Running     0               3m54s   10.104.15.245   4am-node20   <none>           <none>
fouramf-gjwdb-59-4613-zookeeper-1                                 1/1     Running     0               3m53s   10.104.22.41    4am-node26   <none>           <none>
fouramf-gjwdb-59-4613-zookeeper-2                                 1/1     Running     0               3m53s   10.104.23.124   4am-node27   <none>           <none> 

memory: image

elstic avatar May 06 '23 11:05 elstic

Balancer failed to keep row count balance due to load not idempotent @yah01 's #23968 is fixing this

congqixia avatar May 09 '23 06:05 congqixia

/assign @elstic could you please verify whether is problem still persist after patch merged?

congqixia avatar May 15 '23 09:05 congqixia

/assign @elstic could you please verify whether is problem still persist after patch merged?

Verified that the issue has been fixed . Verify image: master-20230515-8a85dd68

elstic avatar May 17 '23 02:05 elstic