milvus
milvus copied to clipboard
[Bug]: [benchmark] querynode memory allocation imbalance, resulting in load failure
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master-20230424-e23af640
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
hnsw index, insert 100m data, load fails and reports: "deny to load, insufficient memory, please allocate more resources".
argo task : fouramf-lwq89 case: test_concurrent_locust_100m_hnsw_ddl_dql_filter_kafka_cluster This case memory allocation is previously set, given the 2 querynode memory limit of 64g , enough for 100m data.
server
fouramf-lwq89-90-4387-etcd-0 1/1 Running 0 4m23s 10.104.20.15 4am-node22 <none> <none>
fouramf-lwq89-90-4387-etcd-1 1/1 Running 0 4m23s 10.104.18.201 4am-node25 <none> <none>
fouramf-lwq89-90-4387-etcd-2 1/1 Running 0 4m23s 10.104.22.128 4am-node26 <none> <none>
fouramf-lwq89-90-4387-kafka-0 1/1 Running 0 4m23s 10.104.20.17 4am-node22 <none> <none>
fouramf-lwq89-90-4387-kafka-1 1/1 Running 3 (2m59s ago) 4m23s 10.104.5.167 4am-node12 <none> <none>
fouramf-lwq89-90-4387-kafka-2 1/1 Running 3 (3m12s ago) 4m23s 10.104.9.140 4am-node14 <none> <none>
fouramf-lwq89-90-4387-milvus-datacoord-56f6698fb8-r2cbf 1/1 Running 0 4m23s 10.104.16.22 4am-node21 <none> <none>
fouramf-lwq89-90-4387-milvus-datanode-888977d79-rzmxh 1/1 Running 0 4m23s 10.104.15.226 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-indexcoord-778b7df58b-x4nml 1/1 Running 0 4m23s 10.104.15.225 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-indexnode-6b7c66d545-nfx7s 1/1 Running 0 4m23s 10.104.15.224 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-proxy-5d75f8c7cb-hr29w 1/1 Running 0 4m23s 10.104.20.10 4am-node22 <none> <none>
fouramf-lwq89-90-4387-milvus-querycoord-7bf7f7999c-878xq 1/1 Running 0 4m23s 10.104.15.227 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-glhnj 1/1 Running 0 4m23s 10.104.20.11 4am-node22 <none> <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-kvft6 1/1 Running 0 4m23s 10.104.16.23 4am-node21 <none> <none>
fouramf-lwq89-90-4387-milvus-rootcoord-6db8f647ff-sm8rw 1/1 Running 0 4m23s 10.104.15.223 4am-node20 <none> <none>
fouramf-lwq89-90-4387-minio-0 1/1 Running 0 4m23s 10.104.16.25 4am-node21 <none> <none>
fouramf-lwq89-90-4387-minio-1 1/1 Running 0 4m23s 10.104.20.19 4am-node22 <none> <none>
fouramf-lwq89-90-4387-minio-2 1/1 Running 0 4m23s 10.104.18.203 4am-node25 <none> <none>
fouramf-lwq89-90-4387-minio-3 1/1 Running 0 4m23s 10.104.19.139 4am-node28 <none> <none>
fouramf-lwq89-90-4387-zookeeper-0 1/1 Running 0 4m23s 10.104.20.16 4am-node22 <none> <none>
fouramf-lwq89-90-4387-zookeeper-1 1/1 Running 0 4m23s 10.104.18.200 4am-node25 <none> <none>
fouramf-lwq89-90-4387-zookeeper-2 1/1 Running 0 4m23s 10.104.22.122 4am-node26 <none> <none>
server (querynode restart)
fouramf-lwq89-90-4387-etcd-0 1/1 Running 0 161m 10.104.20.15 4am-node22 <none> <none>
fouramf-lwq89-90-4387-etcd-1 1/1 Running 0 161m 10.104.18.201 4am-node25 <none> <none>
fouramf-lwq89-90-4387-etcd-2 1/1 Running 0 161m 10.104.22.128 4am-node26 <none> <none>
fouramf-lwq89-90-4387-kafka-0 1/1 Running 0 161m 10.104.20.17 4am-node22 <none> <none>
fouramf-lwq89-90-4387-kafka-1 1/1 Running 3 (159m ago) 161m 10.104.5.167 4am-node12 <none> <none>
fouramf-lwq89-90-4387-kafka-2 1/1 Running 3 (159m ago) 161m 10.104.9.140 4am-node14 <none> <none>
fouramf-lwq89-90-4387-milvus-datacoord-56f6698fb8-r2cbf 1/1 Running 0 161m 10.104.16.22 4am-node21 <none> <none>
fouramf-lwq89-90-4387-milvus-datanode-888977d79-rzmxh 1/1 Running 0 161m 10.104.15.226 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-indexcoord-778b7df58b-x4nml 1/1 Running 0 161m 10.104.15.225 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-indexnode-6b7c66d545-nfx7s 1/1 Running 0 161m 10.104.15.224 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-proxy-5d75f8c7cb-hr29w 1/1 Running 0 161m 10.104.20.10 4am-node22 <none> <none>
fouramf-lwq89-90-4387-milvus-querycoord-7bf7f7999c-878xq 1/1 Running 0 161m 10.104.15.227 4am-node20 <none> <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-glhnj 1/1 Running 1 (11m ago) 161m 10.104.20.11 4am-node22 <none> <none>
fouramf-lwq89-90-4387-milvus-querynode-657b589777-kvft6 1/1 Running 1 (10m ago) 161m 10.104.16.23 4am-node21 <none> <none>
fouramf-lwq89-90-4387-milvus-rootcoord-6db8f647ff-sm8rw 1/1 Running 0 161m 10.104.15.223 4am-node20 <none> <none>
fouramf-lwq89-90-4387-minio-0 1/1 Running 0 161m 10.104.16.25 4am-node21 <none> <none>
fouramf-lwq89-90-4387-minio-1 1/1 Running 0 161m 10.104.20.19 4am-node22 <none> <none>
fouramf-lwq89-90-4387-minio-2 1/1 Running 0 161m 10.104.18.203 4am-node25 <none> <none>
fouramf-lwq89-90-4387-minio-3 1/1 Running 0 161m 10.104.19.139 4am-node28 <none> <none>
fouramf-lwq89-90-4387-zookeeper-0 1/1 Running 0 161m 10.104.20.16 4am-node22 <none> <none>
fouramf-lwq89-90-4387-zookeeper-1 1/1 Running 0 161m 10.104.18.200 4am-node25 <none> <none>
fouramf-lwq89-90-4387-zookeeper-2 1/1 Running 0 161m 10.104.22.122 4am-node26 <none> <none>
client :
[2023-04-24 14:15:08,894 - INFO - fouram]: [CommonCases] RT of build index HNSW: 4333.0958s (common_cases.py:87)
[2023-04-24 14:15:08,898 - INFO - fouram]: [Base] Params of index: {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 8, 'efConstruction': 200}} (base.py:297)
[2023-04-24 14:15:08,898 - INFO - fouram]: [CommonCases] Prepare index HNSW done. (common_cases.py:90)
[2023-04-24 14:15:08,898 - INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:95)
[2023-04-24 14:15:08,898 - INFO - fouram]: [Base] Start load collection fouram_2qUsR2gY,replica_number:1,kwargs:{} (base.py:138)
[2023-04-24 14:27:26,018 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_2qUsR2gY)>, <Time:{'RPC start': '2023-04-24 14:27:26.016189', 'RPC error': '2023-04-24 14:27:26.018055'}> (decorators.py:108)
[2023-04-24 14:27:26,020 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_2qUsR2gY)>, <Time:{'RPC start': '2023-04-24 14:15:08.927685', 'RPC error': '2023-04-24 14:27:26.020265'}> (decorators.py:108)
[2023-04-24 14:27:26,020 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_2qUsR2gY)>, <Time:{'RPC start': '2023-04-24 14:15:08.898503', 'RPC error': '2023-04-24 14:27:26.020411'}> (decorators.py:108)
[2023-04-24 14:27:26,021 - ERROR - fouram]: Traceback (most recent call last):
File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
res = func(*args, **kwargs)
File "/src/fouram/client/util/api_request.py", line 70, in api_request
return func(*arg, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 366, in load
conn.load_collection(self._name, replica_number=replica_number, timeout=timeout, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
ret = func(self, *args, **kwargs)
memory uasgae:
Expected Behavior
2 querynode memory balancing , load sucess
Steps To Reproduce
1. create a collection or use an existing collection
2. build index on vector column
3. insert a certain number of vectors
4. flush collection
5. build index on vector column with the same parameters
6. build index on on scalars column or not
7. count the total number of rows
8. load collection ===> failed
9. perform concurrent operations
10. clean all collections or not
Milvus Log
No response
Anything else?
No response
/assign @congqixia /unassign
@congqixia I retried and the issue still persists. Used image : master-20230505-38f45c37 argo task: fouramf-gjwdb
server:
fouramf-gjwdb-59-4613-etcd-0 1/1 Running 0 3m54s 10.104.6.245 4am-node13 <none> <none>
fouramf-gjwdb-59-4613-etcd-1 1/1 Running 0 3m53s 10.104.22.40 4am-node26 <none> <none>
fouramf-gjwdb-59-4613-etcd-2 1/1 Running 0 3m53s 10.104.23.122 4am-node27 <none> <none>
fouramf-gjwdb-59-4613-kafka-0 1/1 Running 2 (97s ago) 3m54s 10.104.6.244 4am-node13 <none> <none>
fouramf-gjwdb-59-4613-kafka-1 1/1 Running 1 (110s ago) 3m53s 10.104.4.171 4am-node11 <none> <none>
fouramf-gjwdb-59-4613-kafka-2 1/1 Running 1 (106s ago) 3m53s 10.104.16.126 4am-node21 <none> <none>
fouramf-gjwdb-59-4613-milvus-datacoord-6ddf748f54-x5jjh 1/1 Running 0 3m54s 10.104.6.223 4am-node13 <none> <none>
fouramf-gjwdb-59-4613-milvus-datanode-744c658cdf-mtxnr 1/1 Running 0 3m53s 10.104.6.222 4am-node13 <none> <none>
fouramf-gjwdb-59-4613-milvus-indexcoord-656bdb998d-qmtls 1/1 Running 0 3m54s 10.104.22.35 4am-node26 <none> <none>
fouramf-gjwdb-59-4613-milvus-indexnode-85877fc7f-hprgb 1/1 Running 0 3m54s 10.104.1.82 4am-node10 <none> <none>
fouramf-gjwdb-59-4613-milvus-proxy-94567d57d-tlknd 1/1 Running 0 3m53s 10.104.6.221 4am-node13 <none> <none>
fouramf-gjwdb-59-4613-milvus-querycoord-55dfd559bc-q7nth 1/1 Running 0 3m53s 10.104.22.34 4am-node26 <none> <none>
fouramf-gjwdb-59-4613-milvus-querynode-cddc59-fnndr 1/1 Running 0 3m54s 10.104.9.155 4am-node14 <none> <none>
fouramf-gjwdb-59-4613-milvus-querynode-cddc59-wllqq 1/1 Running 0 3m54s 10.104.4.158 4am-node11 <none> <none>
fouramf-gjwdb-59-4613-milvus-rootcoord-6c96878577-r5wnp 1/1 Running 0 3m54s 10.104.1.81 4am-node10 <none> <none>
fouramf-gjwdb-59-4613-minio-0 1/1 Running 0 3m54s 10.104.6.243 4am-node13 <none> <none>
fouramf-gjwdb-59-4613-minio-1 1/1 Running 0 3m53s 10.104.22.38 4am-node26 <none> <none>
fouramf-gjwdb-59-4613-minio-2 1/1 Running 0 3m53s 10.104.15.247 4am-node20 <none> <none>
fouramf-gjwdb-59-4613-minio-3 1/1 Running 0 3m53s 10.104.23.125 4am-node27 <none> <none>
fouramf-gjwdb-59-4613-zookeeper-0 1/1 Running 0 3m54s 10.104.15.245 4am-node20 <none> <none>
fouramf-gjwdb-59-4613-zookeeper-1 1/1 Running 0 3m53s 10.104.22.41 4am-node26 <none> <none>
fouramf-gjwdb-59-4613-zookeeper-2 1/1 Running 0 3m53s 10.104.23.124 4am-node27 <none> <none>
memory:
Balancer failed to keep row count balance due to load not idempotent @yah01 's #23968 is fixing this
/assign @elstic could you please verify whether is problem still persist after patch merged?
/assign @elstic could you please verify whether is problem still persist after patch merged?
Verified that the issue has been fixed . Verify image: master-20230515-8a85dd68