milvus
milvus copied to clipboard
[Bug]: [benchmark][query] milvus concurrently query and returns vector, querynode restarts, query fails and reports: "fail to query on all shard leaders"
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:master-20230606-ea629228
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
The client side only does concurrent quey
argo task : fouramf-concurrent-h9xqz
06:11:00 server(before):
fouram-40-4340-etcd-0 1/1 Running 0 6m 10.104.17.239 4am-node23 <none> <none>
fouram-40-4340-etcd-1 1/1 Running 0 5m59s 10.104.21.153 4am-node24 <none> <none>
fouram-40-4340-etcd-2 1/1 Running 0 5m59s 10.104.22.82 4am-node26 <none> <none>
fouram-40-4340-milvus-datacoord-6fc7cd4bf6-s6qx4 1/1 Running 1 (119s ago) 6m1s 10.104.21.133 4am-node24 <none> <none>
fouram-40-4340-milvus-datanode-5895fdf7f6-nzzvg 1/1 Running 1 (2m1s ago) 6m1s 10.104.19.46 4am-node28 <none> <none>
fouram-40-4340-milvus-indexcoord-667bf785c6-vz6mb 1/1 Running 0 6m1s 10.104.18.125 4am-node25 <none> <none>
fouram-40-4340-milvus-indexnode-799d598f59-7j52g 1/1 Running 0 6m1s 10.104.18.126 4am-node25 <none> <none>
fouram-40-4340-milvus-proxy-766bc877b8-vh4gf 1/1 Running 1 (2m ago) 6m1s 10.104.21.129 4am-node24 <none> <none>
fouram-40-4340-milvus-querycoord-577dc7f59f-t7bc7 1/1 Running 1 (2m1s ago) 6m1s 10.104.19.45 4am-node28 <none> <none>
fouram-40-4340-milvus-querynode-7b99dbdc45-vgqnq 1/1 Running 0 6m1s 10.104.23.86 4am-node27 <none> <none>
fouram-40-4340-milvus-rootcoord-5559667bb9-stpxs 1/1 Running 1 (2m1s ago) 6m1s 10.104.23.85 4am-node27 <none> <none>
fouram-40-4340-minio-0 1/1 Running 0 6m1s 10.104.21.150 4am-node24 <none> <none>
fouram-40-4340-minio-1 1/1 Running 0 6m1s 10.104.15.130 4am-node20 <none> <none>
fouram-40-4340-minio-2 1/1 Running 0 6m 10.104.17.240 4am-node23 <none> <none>
fouram-40-4340-minio-3 1/1 Running 0 5m59s 10.104.22.80 4am-node26 <none> <none>
fouram-40-4340-pulsar-bookie-0 1/1 Running 0 6m1s 10.104.21.151 4am-node24 <none> <none>
fouram-40-4340-pulsar-bookie-1 1/1 Running 0 6m 10.104.17.241 4am-node23 <none> <none>
fouram-40-4340-pulsar-bookie-2 1/1 Running 0 5m59s 10.104.20.235 4am-node22 <none> <none>
fouram-40-4340-pulsar-bookie-init-r6nff 0/1 Completed 0 6m1s 10.104.21.131 4am-node24 <none> <none>
fouram-40-4340-pulsar-broker-0 1/1 Running 0 6m 10.104.15.117 4am-node20 <none> <none>
fouram-40-4340-pulsar-proxy-0 1/1 Running 0 6m1s 10.104.15.116 4am-node20 <none> <none>
fouram-40-4340-pulsar-pulsar-init-s7cv8 0/1 Completed 0 6m1s 10.104.21.130 4am-node24 <none> <none>
fouram-40-4340-pulsar-recovery-0 1/1 Running 0 6m1s 10.104.21.132 4am-node24 <none> <none>
fouram-40-4340-pulsar-zookeeper-0 1/1 Running 0 6m1s 10.104.21.149 4am-node24 <none> <none>
fouram-40-4340-pulsar-zookeeper-1 1/1 Running 0 4m32s 10.104.17.250 4am-node23 <none> <none>
fouram-40-4340-pulsar-zookeeper-2 1/1 Running 0 3m35s 10.104.5.30 4am-node12 <none> <none>
07:12:53 server (after):
fouram-40-4340-etcd-0 1/1 Running 0 67m 10.104.17.239 4am-node23 <none> <none>
fouram-40-4340-etcd-1 1/1 Running 0 67m 10.104.21.153 4am-node24 <none> <none>
fouram-40-4340-etcd-2 1/1 Running 0 67m 10.104.22.82 4am-node26 <none> <none>
fouram-40-4340-milvus-datacoord-6fc7cd4bf6-s6qx4 1/1 Running 1 (63m ago) 67m 10.104.21.133 4am-node24 <none> <none>
fouram-40-4340-milvus-datanode-5895fdf7f6-nzzvg 1/1 Running 1 (63m ago) 67m 10.104.19.46 4am-node28 <none> <none>
fouram-40-4340-milvus-indexcoord-667bf785c6-vz6mb 1/1 Running 0 67m 10.104.18.125 4am-node25 <none> <none>
fouram-40-4340-milvus-indexnode-799d598f59-7j52g 1/1 Running 0 67m 10.104.18.126 4am-node25 <none> <none>
fouram-40-4340-milvus-proxy-766bc877b8-vh4gf 1/1 Running 1 (63m ago) 67m 10.104.21.129 4am-node24 <none> <none>
fouram-40-4340-milvus-querycoord-577dc7f59f-t7bc7 1/1 Running 1 (63m ago) 67m 10.104.19.45 4am-node28 <none> <none>
fouram-40-4340-milvus-querynode-7b99dbdc45-vgqnq 0/1 Running 16 (5m14s ago) 67m 10.104.23.86 4am-node27 <none> <none>
fouram-40-4340-milvus-rootcoord-5559667bb9-stpxs 1/1 Running 1 (63m ago) 67m 10.104.23.85 4am-node27 <none> <none>
fouram-40-4340-minio-0 1/1 Running 0 67m 10.104.21.150 4am-node24 <none> <none>
fouram-40-4340-minio-1 1/1 Running 0 67m 10.104.15.130 4am-node20 <none> <none>
fouram-40-4340-minio-2 1/1 Running 0 67m 10.104.17.240 4am-node23 <none> <none>
fouram-40-4340-minio-3 1/1 Running 0 67m 10.104.22.80 4am-node26 <none> <none>
fouram-40-4340-pulsar-bookie-0 1/1 Running 0 67m 10.104.21.151 4am-node24 <none> <none>
fouram-40-4340-pulsar-bookie-1 1/1 Running 0 67m 10.104.17.241 4am-node23 <none> <none>
fouram-40-4340-pulsar-bookie-2 1/1 Running 0 67m 10.104.20.235 4am-node22 <none> <none>
fouram-40-4340-pulsar-bookie-init-r6nff 0/1 Completed 0 67m 10.104.21.131 4am-node24 <none> <none>
fouram-40-4340-pulsar-broker-0 1/1 Running 0 67m 10.104.15.117 4am-node20 <none> <none>
fouram-40-4340-pulsar-proxy-0 1/1 Running 0 67m 10.104.15.116 4am-node20 <none> <none>
fouram-40-4340-pulsar-pulsar-init-s7cv8 0/1 Completed 0 67m 10.104.21.130 4am-node24 <none> <none>
fouram-40-4340-pulsar-recovery-0 1/1 Running 0 67m 10.104.21.132 4am-node24 <none> <none>
fouram-40-4340-pulsar-zookeeper-0 1/1 Running 0 67m 10.104.21.149 4am-node24 <none> <none>
fouram-40-4340-pulsar-zookeeper-1 1/1 Running 0 66m 10.104.17.250 4am-node23 <none> <none>
fouram-40-4340-pulsar-zookeeper-2 1/1 Running 0 65m 10.104.5.30 4am-node12 <none> <none>
client pod: fouramf-concurrent-h9xqz-997902271
client error log:
param:
{
"dataset_params": {
"metric_type": "L2",
"dim": 128,
"dataset_name": "sift",
"dataset_size": 1000000,
"ni_per": 50000
},
"collection_params": {
"other_fields": [
"float_1"
],
"shards_num": 2
},
"load_params": {},
"query_params": {},
"search_params": {},
"resource_groups_params": {
"reset": false
},
"index_params": {
"index_type": "HNSW",
"index_param": {
"M": 8,
"efConstruction": 200
}
},
"concurrent_params": {
"concurrent_number": 20,
"during_time": "1h",
"interval": 20,
"spawn_rate": null
},
"concurrent_tasks": [
{
"type": "query",
"weight": 10,
"params": {
"expr": {
"float_1": {
"GT": -1,
"LT": 500000
}
},
"output_fields": [
"float_vector"
],
"ignore_growing": false,
"timeout": 60
}
}
]
}
querynode metrics were not collected
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
diskann indexes, concurrent query have the same problem
/assign @jiaoew1991 /unassign
Can this issue be reproduced?
Can this issue be reproduced?
@jiaoew1991
Yes, it must be present; verified again, using the image : master-20230614-35cb0b5b
@jiaoew1991 This log is confusing but will not cause any problems, so congqi has fixed it in #24741 , the root cause still needs to investigate, pls help on it~
/assign @sunby /unassign
Querynode pod was killed by OOM cause by concurrent query requests.
/assign @elstic /unassign
@elstic how much memory resource for this test?
No more oom, but now concurrently querying the return vector still gives an error.
keep an eye on this issue: https://github.com/milvus-io/milvus/issues/25996