milvus
milvus copied to clipboard
[Bug]: [benchmark] The predicted inverted index resource usage is greater than the actual usage
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.4-20240323-5d3aa2a4-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
argo task: inverted-corn-1711209600 test case name: test_inverted_locust_varchar_dql_cluster
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
inverted-corn-109600-4-78-6884-etcd-0 1/1 Running 0 9h 10.104.25.162 4am-node30 <none> <none>
inverted-corn-109600-4-78-6884-etcd-1 1/1 Running 0 9h 10.104.29.253 4am-node35 <none> <none>
inverted-corn-109600-4-78-6884-etcd-2 1/1 Running 0 9h 10.104.26.114 4am-node32 <none> <none>
inverted-corn-109600-4-78-6884-milvus-datacoord-7fc99977bflqrlb 1/1 Running 0 9h 10.104.25.152 4am-node30 <none> <none>
inverted-corn-109600-4-78-6884-milvus-datanode-695fc49599-tj5r7 1/1 Running 1 (9h ago) 9h 10.104.5.174 4am-node12 <none> <none>
inverted-corn-109600-4-78-6884-milvus-indexcoord-7849d4786dgrd4 1/1 Running 0 9h 10.104.33.20 4am-node36 <none> <none>
inverted-corn-109600-4-78-6884-milvus-indexnode-6cd4b85dbfkgxj2 1/1 Running 0 9h 10.104.31.165 4am-node34 <none> <none>
inverted-corn-109600-4-78-6884-milvus-proxy-584b45776d-d7zn2 1/1 Running 1 (9h ago) 9h 10.104.27.74 4am-node31 <none> <none>
inverted-corn-109600-4-78-6884-milvus-querycoord-5c5d9496cx76km 1/1 Running 1 (9h ago) 9h 10.104.25.151 4am-node30 <none> <none>
inverted-corn-109600-4-78-6884-milvus-querynode-9584877fd-dzk8h 1/1 Running 0 9h 10.104.33.22 4am-node36 <none> <none>
inverted-corn-109600-4-78-6884-milvus-rootcoord-65c86b8b45h6n4g 1/1 Running 1 (9h ago) 9h 10.104.33.21 4am-node36 <none> <none>
inverted-corn-109600-4-78-6884-minio-0 1/1 Running 0 9h 10.104.29.250 4am-node35 <none> <none>
inverted-corn-109600-4-78-6884-minio-1 1/1 Running 0 9h 10.104.25.167 4am-node30 <none> <none>
inverted-corn-109600-4-78-6884-minio-2 1/1 Running 0 9h 10.104.26.113 4am-node32 <none> <none>
inverted-corn-109600-4-78-6884-minio-3 1/1 Running 0 9h 10.104.27.109 4am-node31 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-bookie-0 1/1 Running 0 9h 10.104.26.104 4am-node32 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-bookie-1 1/1 Running 0 9h 10.104.29.251 4am-node35 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-bookie-2 1/1 Running 0 9h 10.104.25.168 4am-node30 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-bookie-init-bk9rh 0/1 Completed 0 9h 10.104.33.18 4am-node36 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-broker-0 1/1 Running 0 9h 10.104.27.72 4am-node31 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-proxy-0 1/1 Running 0 9h 10.104.25.153 4am-node30 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-pulsar-init-tk7lf 0/1 Completed 0 9h 10.104.33.19 4am-node36 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-recovery-0 1/1 Running 0 9h 10.104.27.75 4am-node31 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-zookeeper-0 1/1 Running 0 9h 10.104.26.103 4am-node32 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-zookeeper-1 1/1 Running 0 9h 10.104.25.174 4am-node30 <none> <none>
inverted-corn-109600-4-78-6884-pulsar-zookeeper-2 1/1 Running 0 9h 10.104.27.115 4am-node31 <none> <none>
The actual memory used by queryNode is about 1G, and the predicted memory usage is about 10G.
2024-03-24 00:55:22.193
(no unique labels)
[2024/03/24 00:55:22.193 +00:00] [INFO] [segments/segment_loader.go:1412] ["predict memory and disk usage while loading (in MiB)"] [traceID=fcfd02781e54186abda5bdf3e6ca6dbb] [collectionID=448583440405365385] [maxSegmentSize(MB)=10383.269264221191] [committedMemSize(MB)=10606.47898387909] [memLimit(MB)=65536] [memUsage(MB)=10697.96726512909] [committedDiskSize(MB)=0] [diskUsage(MB)=0] [predictMemUsage(MB)=21081.23652935028] [predictDiskUsage(MB)=0] [mmapFieldCount=0]
2024-03-24 00:55:22.192
(no unique labels)
[2024/03/24 00:55:22.192 +00:00] [INFO] [segments/segment_loader.go:1412] ["predict memory and disk usage while loading (in MiB)"] [traceID=fcfd02781e54186abda5bdf3e6ca6dbb] [collectionID=448583440405365385] [maxSegmentSize(MB)=10606.47898387909] [committedMemSize(MB)=0] [memLimit(MB)=65536] [memUsage(MB)=91.48828125] [committedDiskSize(MB)=0] [diskUsage(MB)=0] [predictMemUsage(MB)=10697.96726512909] [predictDiskUsage(MB)=0] [mmapFieldCount=0]
client pod name: inverted-corn-1711209600-971096786
Expected Behavior
No response
Steps To Reproduce
concurrent test and calculation of RT and QPS
:purpose: `varchar: different max_length`
verify concurrent DQL scenario which has 3 VARCHAR scalars fields and creating INVERTED index
:test steps:
1. create collection with fields:
'float_vector': 3dim,
'varchar_1': max_length=256, varchar_filled=True
'varchar_2': max_length=32768, varchar_filled=True
'varchar_3': max_length=65535, varchar_filled=True
2. build indexes:
IVF_FLAT: 'float_vector'
INVERTED: 'varchar_1', 'varchar_2', 'varchar_3'
3. insert 300k data
4. flush collection
5. build indexes again using the same params
6. load collection
7. concurrent request:
- search
- query
Milvus Log
No response
Anything else?
test result:
{'server': {'deploy_tool': 'helm',
'deploy_mode': 'cluster',
'config_name': 'cluster_2c4m',
'config': {'queryNode': {'resources': {'limits': {'cpu': '8',
'memory': '64Gi'},
'requests': {'cpu': '8',
'memory': '32Gi'}},
'replicas': 1},
'indexNode': {'resources': {'limits': {'cpu': '4.0',
'memory': '16Gi'},
'requests': {'cpu': '3.0',
'memory': '9Gi'}},
'replicas': 1},
'dataNode': {'resources': {'limits': {'cpu': '2.0',
'memory': '4Gi'},
'requests': {'cpu': '2.0',
'memory': '3Gi'}}},
'cluster': {'enabled': True},
'pulsar': {},
'kafka': {},
'minio': {'metrics': {'podMonitor': {'enabled': True}}},
'etcd': {'metrics': {'enabled': True,
'podMonitor': {'enabled': True}}},
'metrics': {'serviceMonitor': {'enabled': True}},
'log': {'level': 'debug'},
'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus',
'tag': '2.4-20240323-5d3aa2a4-amd64'}}},
'host': 'inverted-corn-109600-4-78-6884-milvus.qa-milvus.svc.cluster.local',
'port': '19530',
'uri': ''},
'client': {'test_case_type': 'ConcurrentClientBase',
'test_case_name': 'test_inverted_locust_varchar_dql_cluster',
'test_case_params': {'dataset_params': {'metric_type': 'L2',
'dim': 3,
'scalars_index': {'varchar_1': {'index_type': 'INVERTED'},
'varchar_2': {'index_type': 'INVERTED'},
'varchar_3': {'index_type': 'INVERTED'}},
'scalars_params': {'varchar_1': {'params': {'max_length': 256},
'other_params': {'varchar_filled': True}},
'varchar_2': {'params': {'max_length': 32768},
'other_params': {'varchar_filled': True}},
'varchar_3': {'params': {'max_length': 65535},
'other_params': {'varchar_filled': True}}},
'dataset_name': 'local',
'dataset_size': 300000,
'ni_per': 50},
'collection_params': {'other_fields': ['varchar_1',
'varchar_2',
'varchar_3'],
'shards_num': 2},
'resource_groups_params': {'reset': False},
'database_user_params': {'reset_rbac': False,
'reset_db': False},
'index_params': {'index_type': 'IVF_FLAT',
'index_param': {'nlist': 1024}},
'concurrent_params': {'concurrent_number': 50,
'during_time': '1h',
'interval': 20,
'spawn_rate': None},
'concurrent_tasks': [{'type': 'search',
'weight': 1,
'params': {'nq': 1000,
'top_k': 10,
'search_param': {'nprobe': 32},
'expr': 'varchar_1 '
'like '
'"a%" '
'&& '
'varchar_2 '
'like '
'"A%" '
'&& '
'varchar_3 '
'like '
'"0%" '
'&& '
'id '
'> 0',
'guarantee_timestamp': None,
'partition_names': None,
'output_fields': None,
'ignore_growing': False,
'group_by_field': None,
'timeout': 60,
'random_data': True}},
{'type': 'query',
'weight': 1,
'params': {'ids': None,
'expr': 'id '
'> '
'-1 '
'&&',
'output_fields': ['float_vector'],
'offset': None,
'limit': None,
'ignore_growing': False,
'partition_names': None,
'timeout': 60,
'random_data': True,
'random_count': 10,
'random_range': [0,
150000.0],
'field_name': 'id',
'field_type': 'int64'}}]},
'run_id': 2024032397268355,
'datetime': '2024-03-23 16:02:06.668282',
'client_version': '2.4.0'},
'result': {'test_result': {'index': {'RT': 3679.7159,
'varchar_1': {'RT': 4013.0191},
'varchar_2': {'RT': 2395.9315},
'varchar_3': {'RT': 6678.7619}},
'insert': {'total_time': 603.547,
'VPS': 497.0615,
'batch_time': 0.1006,
'batch': 50},
'flush': {'RT': 3.024},
'load': {'RT': 55.4049},
'Locust': {'Aggregated': {'Requests': 12142,
'Fails': 0,
'RPS': 3.37,
'fail_s': 0.0,
'RT_max': 40600.92,
'RT_avg': 14777.83,
'TP50': 7300.0,
'TP99': 35000.0},
'query': {'Requests': 6021,
'Fails': 0,
'RPS': 1.67,
'fail_s': 0.0,
'RT_max': 40600.92,
'RT_avg': 23803.72,
'TP50': 23000.0,
'TP99': 37000.0},
'search': {'Requests': 6121,
'Fails': 0,
'RPS': 1.7,
'fail_s': 0.0,
'RT_max': 9874.65,
'RT_avg': 5899.4,
'TP50': 5900.0,
'TP99': 7300.0}}}}}
Recurrent
argo task: inverted-corn-1711728000 test case name: test_inverted_locust_varchar_dml_dql_cluster image: 2.4-20240329-32eff9c6e-amd64
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
inverted-corn-128000-5-12-8571-etcd-0 1/1 Running 0 6m32s 10.104.25.71 4am-node30 <none> <none>
inverted-corn-128000-5-12-8571-etcd-1 1/1 Running 0 6m32s 10.104.28.63 4am-node33 <none> <none>
inverted-corn-128000-5-12-8571-etcd-2 1/1 Running 0 6m32s 10.104.33.206 4am-node36 <none> <none>
inverted-corn-128000-5-12-8571-milvus-datacoord-76cddfc5c5hk9n8 1/1 Running 0 6m32s 10.104.14.154 4am-node18 <none> <none>
inverted-corn-128000-5-12-8571-milvus-datanode-5f88d9f47f-dz8d4 1/1 Running 1 (2m1s ago) 6m32s 10.104.14.152 4am-node18 <none> <none>
inverted-corn-128000-5-12-8571-milvus-indexcoord-7cc64bb7btgvhc 1/1 Running 0 6m32s 10.104.27.188 4am-node31 <none> <none>
inverted-corn-128000-5-12-8571-milvus-indexnode-5b6b5ddfb92xld7 1/1 Running 0 6m32s 10.104.15.231 4am-node20 <none> <none>
inverted-corn-128000-5-12-8571-milvus-proxy-fb4f4578b-98l8q 1/1 Running 1 (2m1s ago) 6m32s 10.104.14.153 4am-node18 <none> <none>
inverted-corn-128000-5-12-8571-milvus-querycoord-68896b896spjtn 1/1 Running 1 (2m1s ago) 6m32s 10.104.14.151 4am-node18 <none> <none>
inverted-corn-128000-5-12-8571-milvus-querynode-b8f47f9ff-nxs5g 1/1 Running 0 6m32s 10.104.5.111 4am-node12 <none> <none>
inverted-corn-128000-5-12-8571-milvus-rootcoord-86d5565b7-hbmwh 1/1 Running 1 (2m ago) 6m32s 10.104.25.65 4am-node30 <none> <none>
inverted-corn-128000-5-12-8571-minio-0 1/1 Running 0 6m32s 10.104.25.72 4am-node30 <none> <none>
inverted-corn-128000-5-12-8571-minio-1 1/1 Running 0 6m31s 10.104.28.64 4am-node33 <none> <none>
inverted-corn-128000-5-12-8571-minio-2 1/1 Running 0 6m31s 10.104.27.194 4am-node31 <none> <none>
inverted-corn-128000-5-12-8571-minio-3 1/1 Running 0 6m31s 10.104.17.218 4am-node23 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-bookie-0 1/1 Running 0 6m32s 10.104.25.73 4am-node30 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-bookie-1 1/1 Running 0 6m31s 10.104.28.65 4am-node33 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-bookie-2 1/1 Running 0 6m31s 10.104.17.219 4am-node23 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-bookie-init-wt7wf 0/1 Completed 0 6m32s 10.104.17.212 4am-node23 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-broker-0 1/1 Running 0 6m32s 10.104.27.189 4am-node31 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-proxy-0 1/1 Running 0 6m32s 10.104.33.204 4am-node36 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-pulsar-init-42c74 0/1 Completed 0 6m32s 10.104.25.64 4am-node30 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-recovery-0 1/1 Running 0 6m32s 10.104.17.211 4am-node23 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-zookeeper-0 1/1 Running 0 6m32s 10.104.27.192 4am-node31 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-zookeeper-1 1/1 Running 0 5m50s 10.104.18.105 4am-node25 <none> <none>
inverted-corn-128000-5-12-8571-pulsar-zookeeper-2 1/1 Running 0 4m46s 10.104.23.61 4am-node27 <none> <none>
client logs:
test steps:
concurrent test and calculation of RT and QPS
:purpose: `varchar: different max_length`
verify concurrent DML & DQL scenario which has 3 VARCHAR scalars fields and creating INVERTED index
:test steps:
1. create collection with fields:
'float_vector': 3dim,
'varchar_1': max_length=256, varchar_filled=True
'varchar_2': max_length=32768, varchar_filled=True
'varchar_3': max_length=65535, varchar_filled=True
2. build indexes:
IVF_FLAT: 'float_vector'
INVERTED: 'varchar_1', 'varchar_2', 'varchar_3'
3. insert 300k data
4. flush collection
5. build indexes again using the same params
6. load collection <- raise error
7. concurrent request:
- insert
- delete
- flush
- load
- search
- hybrid_search
- query
server config:
{
"queryNode": {
"resources": {
"limits": {
"cpu": "8",
"memory": "32Gi"
},
"requests": {
"cpu": "8",
"memory": "32Gi"
}
},
"replicas": 1
},
"indexNode": {
"resources": {
"limits": {
"cpu": "4.0",
"memory": "16Gi"
},
"requests": {
"cpu": "3.0",
"memory": "9Gi"
}
},
"replicas": 1
},
"dataNode": {
"resources": {
"limits": {
"cpu": "2.0",
"memory": "4Gi"
},
"requests": {
"cpu": "2.0",
"memory": "3Gi"
}
}
},
"cluster": {
"enabled": true
},
"pulsar": {},
"kafka": {},
"minio": {
"metrics": {
"podMonitor": {
"enabled": true
}
}
},
"etcd": {
"metrics": {
"enabled": true,
"podMonitor": {
"enabled": true
}
}
},
"metrics": {
"serviceMonitor": {
"enabled": true
}
},
"log": {
"level": "debug"
},
"image": {
"all": {
"repository": "harbor.milvus.io/milvus/milvus",
"tag": "2.4-20240329-32eff9c6e-amd64"
}
}
}
client config:
{
"dataset_params": {
"metric_type": "L2",
"dim": 3,
"scalars_index": {
"varchar_1": {
"index_type": "INVERTED"
},
"varchar_2": {
"index_type": "INVERTED"
},
"varchar_3": {
"index_type": "INVERTED"
}
},
"scalars_params": {
"varchar_1": {
"params": {
"max_length": 256
},
"other_params": {
"varchar_filled": true
}
},
"varchar_2": {
"params": {
"max_length": 32768
},
"other_params": {
"varchar_filled": true
}
},
"varchar_3": {
"params": {
"max_length": 65535
},
"other_params": {
"varchar_filled": true
}
}
},
"dataset_name": "local",
"dataset_size": 300000,
"ni_per": 50
},
"collection_params": {
"other_fields": [
"varchar_1",
"varchar_2",
"varchar_3"
],
"shards_num": 2
},
"resource_groups_params": {
"reset": false
},
"database_user_params": {
"reset_rbac": false,
"reset_db": false
},
"index_params": {
"index_type": "IVF_FLAT",
"index_param": {
"nlist": 1024
}
},
"concurrent_params": {
"concurrent_number": [
50
],
"during_time": "1h",
"interval": 20
},
"concurrent_tasks": [
{
"type": "insert",
"weight": 1,
"params": {
"nb": 10,
"timeout": 30,
"random_id": true,
"random_vector": true,
"varchar_filled": false,
"start_id": 300000
}
},
{
"type": "delete",
"weight": 1,
"params": {
"expr": "",
"delete_length": 10,
"timeout": 30
}
},
{
"type": "flush",
"weight": 1,
"params": {
"timeout": 600
}
},
{
"type": "load",
"weight": 1,
"params": {
"replica_number": 1,
"timeout": 30
}
},
{
"type": "search",
"weight": 1,
"params": {
"nq": 1000,
"top_k": 1,
"search_param": {
"nprobe": 32
},
"expr": "varchar_1 like \"a%\" && varchar_2 like \"A%\" && varchar_3 like \"0%\" && id > 0",
"guarantee_timestamp": null,
"partition_names": null,
"output_fields": null,
"ignore_growing": false,
"group_by_field": null,
"timeout": 60,
"random_data": true
}
},
{
"type": "hybrid_search",
"weight": 1,
"params": {
"nq": 1,
"top_k": 10,
"reqs": [
{
"search_param": {
"nprobe": 16
},
"anns_field": "float_vector",
"expr": "varchar_1 like \"0%\"",
"top_k": 2000
},
{
"search_param": {
"nprobe": 128
},
"anns_field": "float_vector",
"expr": "varchar_2 like \"9%\""
}
],
"rerank": {
"WeightedRanker": [
0.5,
0.5
]
},
"output_fields": [
"*"
],
"ignore_growing": false,
"guarantee_timestamp": null,
"partition_names": null,
"timeout": 60,
"random_data": true
}
},
{
"type": "query",
"weight": 1,
"params": {
"ids": null,
"expr": "varchar_3 like \"a%\" && ",
"output_fields": [
"*"
],
"offset": null,
"limit": null,
"ignore_growing": false,
"partition_names": null,
"timeout": 60,
"random_data": true,
"random_count": 20,
"random_range": [
0,
150000
],
"field_name": "id",
"field_type": "int64"
}
}
]
}
Let me explain why this issue still exists even after https://github.com/milvus-io/milvus/pull/31615 is merged.
Suppose we have 2 segments, which are seg_a and seg_b, and their vector index were built already. When the users are trying to load the collection, querycoord think seg_c should be loaded, which seg_c was compacted by seg_a and seg_b. The vector index for seg_c was ready, but the scalar index is in-progress. So the scalar will be loaded using raw data, which lead the memory prediction too high.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.