milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark][cluster]Milvus is inserted and queried at the same time ,the querynode memory gradually rises to OOM

Open jingkl opened this issue 2 years ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.1.0-20220727-7169256c
- Deployment mode(standalone or cluster)cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus 2.1.0dev103
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server-instance fouram-6dgds-1 server-configmap server-cluster-8c64m-compaction client-configmap client-random-locust-compaction-5h

fouram-6dgds-1-etcd-0                                            1/1     Running     0               13h     10.104.4.179   4am-node11   <none>           <none>
fouram-6dgds-1-etcd-1                                            1/1     Running     0               13h     10.104.6.170   4am-node13   <none>           <none>
fouram-6dgds-1-etcd-2                                            1/1     Running     0               13h     10.104.1.101   4am-node10   <none>           <none>
fouram-6dgds-1-milvus-datacoord-68c7f8f8fc-t984h                 1/1     Running     1 (12h ago)     13h     10.104.4.172   4am-node11   <none>           <none>
fouram-6dgds-1-milvus-datanode-79fb9687c6-26pb6                  1/1     Running     1 (12h ago)     13h     10.104.4.173   4am-node11   <none>           <none>
fouram-6dgds-1-milvus-indexcoord-5cf68cff5d-jg969                1/1     Running     1 (12h ago)     13h     10.104.6.167   4am-node13   <none>           <none>
fouram-6dgds-1-milvus-indexnode-6dd4566667-2wnhg                 1/1     Running     0               13h     10.104.5.165   4am-node12   <none>           <none>
fouram-6dgds-1-milvus-proxy-6b68c77696-x2426                     1/1     Running     1 (12h ago)     13h     10.104.4.171   4am-node11   <none>           <none>
fouram-6dgds-1-milvus-querycoord-6459dd997c-pq7r8                1/1     Running     1 (12h ago)     13h     10.104.5.168   4am-node12   <none>           <none>
fouram-6dgds-1-milvus-querynode-b4fd78f45-24d2r                  1/1     Running     8 (10h ago)     13h     10.104.5.171   4am-node12   <none>           <none>
fouram-6dgds-1-milvus-rootcoord-6f5687c6fc-ccwwr                 1/1     Running     0               13h     10.104.5.167   4am-node12   <none>           <none>
fouram-6dgds-1-minio-0                                           1/1     Running     0               13h     10.104.4.177   4am-node11   <none>           <none>
fouram-6dgds-1-minio-1                                           1/1     Running     0               13h     10.104.1.98    4am-node10   <none>           <none>
fouram-6dgds-1-minio-2                                           1/1     Running     0               13h     10.104.6.175   4am-node13   <none>           <none>
fouram-6dgds-1-minio-3                                           1/1     Running     0               13h     10.104.5.173   4am-node12   <none>           <none>
fouram-6dgds-1-pulsar-bookie-0                                   1/1     Running     0               13h     10.104.6.174   4am-node13   <none>           <none>
fouram-6dgds-1-pulsar-bookie-1                                   1/1     Running     0               13h     10.104.1.102   4am-node10   <none>           <none>
fouram-6dgds-1-pulsar-bookie-2                                   1/1     Running     0               13h     10.104.4.182   4am-node11   <none>           <none>
fouram-6dgds-1-pulsar-bookie-init-wxwtl                          0/1     Completed   0               13h     10.104.5.169   4am-node12   <none>           <none>
fouram-6dgds-1-pulsar-broker-0                                   1/1     Running     0               13h     10.104.6.168   4am-node13   <none>           <none>
fouram-6dgds-1-pulsar-proxy-0                                    1/1     Running     0               13h     10.104.5.166   4am-node12   <none>           <none>
fouram-6dgds-1-pulsar-pulsar-init-vj9ql                          0/1     Completed   0               13h     10.104.5.170   4am-node12   <none>           <none>
fouram-6dgds-1-pulsar-recovery-0                                 1/1     Running     0               13h     10.104.1.95    4am-node10   <none>           <none>
fouram-6dgds-1-pulsar-zookeeper-0                                1/1     Running     0               13h     10.104.4.178   4am-node11   <none>           <none>
fouram-6dgds-1-pulsar-zookeeper-1                                1/1     Running     0               13h     10.104.6.177   4am-node13   <none>           <none>
fouram-6dgds-1-pulsar-zookeeper-2                                1/1     Running     0               13h     10.104.1.104   4am-node10   <none>           <none>

querynode mem: 截屏2022-07-29 10 01 03

tart': '2022-07-28 13:15:01.422718', 'RPC error': '2022-07-28 13:15:01.456803'}> (pymilvus.decorators:95)
[2022-07-28 13:15:01,462] [   DEBUG] - Milvus get_info run in 0.0208s (milvus_benchmark.client:56)
[2022-07-28 13:15:01,470] [   DEBUG] - [scene_insert_delete_flush] Start insert : sift_10w_128_l2 (milvus_benchmark.client:651)
[2022-07-28 13:15:01,472] [   ERROR] - RPC error: [search], <MilvusException: (code=1, message=Invalid shard leader)>, <Time:{'RPC s
tart': '2022-07-28 13:15:01.422871', 'RPC error': '2022-07-28 13:15:01.472392'}> (pymilvus.decorators:95)
[2022-07-28 13:15:01,472] [   ERROR] - RPC error: [search], <MilvusException: (code=1, message=Invalid shard leader)>, <Time:{'RPC s
tart': '2022-07-28 13:15:01.423086', 'RPC error': '2022-07-28 13:15:01.472789'}> (pymilvus.decorators:95)
[2022-07-28 13:15:01,472] [   ERROR] - RPC error: [search], <MilvusException: (code=1, message=Invalid shard leader)>, <Time:{'RPC s
tart': '2022-07-28 13:15:01.422981', 'RPC error': '2022-07-28 13:15:01.472983'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,773] [   ERROR] - RPC error: [query], <MilvusException: (code=1, message=fail to search on all shard leaders, e
rr=fail to Query, QueryNode ID = 3, reason=query shard(channel)  by-dev-rootcoord-dml_1_434900411607354689v1  does not exist
)>, <Time:{'RPC start': '2022-07-28 13:15:01.443969', 'RPC error': '2022-07-28 13:15:05.773038'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,773] [   ERROR] - RPC error: [query], <MilvusException: (code=1, message=fail to search on all shard leaders, e
rr=fail to Query, QueryNode ID = 3, reason=query shard(channel)  by-dev-rootcoord-dml_1_434900411607354689v1  does not exist
)>, <Time:{'RPC start': '2022-07-28 13:15:01.444121', 'RPC error': '2022-07-28 13:15:05.773791'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,774] [   ERROR] - RPC error: [query], <MilvusException: (code=1, message=fail to search on all shard leaders, e
rr=fail to Query, QueryNode ID = 3, reason=query shard(channel)  by-dev-rootcoord-dml_1_434900411607354689v1  does not exist
)>, <Time:{'RPC start': '2022-07-28 13:15:01.445513', 'RPC error': '2022-07-28 13:15:05.774003'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,774] [   ERROR] - RPC error: [query], <MilvusException: (code=1, message=fail to search on all shard leaders, e
rr=fail to Query, QueryNode ID = 3, reason=query shard(channel)  by-dev-rootcoord-dml_1_434900411607354689v1  does not exist
)>, <Time:{'RPC start': '2022-07-28 13:15:01.445369', 'RPC error': '2022-07-28 13:15:05.774165'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,774] [   ERROR] - RPC error: [query], <MilvusException: (code=1, message=fail to search on all shard leaders, e
rr=fail to Query, QueryNode ID = 3, reason=query shard(channel)  by-dev-rootcoord-dml_0_434900411607354689v0  does not exist
)>, <Time:{'RPC start': '2022-07-28 13:15:01.444540', 'RPC error': '2022-07-28 13:15:05.774319'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,774] [   ERROR] - RPC error: [search], <MilvusException: (code=1, message=Invalid shard leader)>, <Time:{'RPC s
tart': '2022-07-28 13:15:01.457276', 'RPC error': '2022-07-28 13:15:05.774819'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,775] [   ERROR] - RPC error: [search], <MilvusException: (code=1, message=Invalid shard leader)>, <Time:{'RPC s
tart': '2022-07-28 13:15:01.443467', 'RPC error': '2022-07-28 13:15:05.775050'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,775] [   ERROR] - RPC error: [search], <MilvusException: (code=1, message=Invalid shard leader)>, <Time:{'RPC s
tart': '2022-07-28 13:15:01.444339', 'RPC error': '2022-07-28 13:15:05.775224'}> (pymilvus.decorators:95)
[2022-07-28 13:15:05,775] [   ERROR] - RPC error: [search], <MilvusException: (code=1, message=Invalid shard leader)>, <Time:{'RPC s
tart': '2022-07-28 13:15:01.445701', 'RPC error': '2022-07-28 13:15:05.775389'}> (pymilvus.decorators:95)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

client-random-locust-compaction-5h:

{
	"config.yaml": "locust_random_performance:
		  collections:
		    -
		      collection_name: sift_10w_128_l2
		      ni_per: 50000
		      # other_fields: int1,int2,float1,double1
		      other_fields: float1
		      build_index: true
		      index_type: ivf_sq8
		      index_param:
		        nlist: 2048
		      task:
		        types:
		          -
		            type: query
		            weight: 20
		            params:
		              top_k: 10
		              nq: 10
		              search_param:
		                nprobe: 16
		              filters:
		                -
		                  range: \"{'range': {'float1': {'GT': -1.0, 'LT': collection_size * 0.5}}}\"
		          -
		            type: load
		            weight: 1
		          -
		            type: get
		            weight: 10
		            params:
		              ids_length: 10
		          -
		            type: scene_insert_delete_flush
		            weight: 1
		        connection_num: 1
		        clients_num: 20
		        spawn_rate: 2
		        # during_time: 84h
		        during_time: 5h
		"
}

jingkl avatar Jul 29 '22 02:07 jingkl

from analyzing Log and memory, we can figure out that the load action reads the vector field other than index files, so the memory usage is much larger than the 2.1 branch.

the following steps should find out why the index did not create as we expected

jiaoew1991 avatar Aug 11 '22 03:08 jiaoew1991

@aoiasd pls follow up this issue

jiaoew1991 avatar Aug 29 '22 03:08 jiaoew1991

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 28 '22 11:09 stale[bot]