milvus
milvus copied to clipboard
[Bug]: [benchmark] Some flush 30s timeout failures during concurrency testing
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.4-20240417-8f7ac8f7
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
case : test_concurrent_locust_hnsw_dql_filter_insert_cluster argo task : fouram-memory-index-stab-1713380400, id: 4
4am cluster, qa-milvus ns , server:
fouram-memory-i80400-4-86-5003-etcd-0 1/1 Running 0 5h17m 10.104.26.116 4am-node32 <none> <none>
fouram-memory-i80400-4-86-5003-etcd-1 1/1 Running 0 5h17m 10.104.18.151 4am-node25 <none> <none>
fouram-memory-i80400-4-86-5003-etcd-2 1/1 Running 0 5h17m 10.104.28.53 4am-node33 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-datacoord-548bbdccb8l7pbw 1/1 Running 0 5h17m 10.104.9.170 4am-node14 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-datanode-5b666d79ff-jzlf9 1/1 Running 3 (5h4m ago) 5h17m 10.104.19.81 4am-node28 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-indexcoord-fbdc465474n787 1/1 Running 0 5h17m 10.104.23.254 4am-node27 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-indexnode-5c6b45c578b29pv 1/1 Running 0 5h17m 10.104.32.250 4am-node39 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-proxy-84dddcc47f-82glq 1/1 Running 3 (5h4m ago) 5h17m 10.104.24.93 4am-node29 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-querycoord-59f6d94649z5p9 1/1 Running 3 (5h4m ago) 5h17m 10.104.24.92 4am-node29 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-querynode-549cf45db4k2cgr 1/1 Running 0 5h17m 10.104.23.2 4am-node27 <none> <none>
fouram-memory-i80400-4-86-5003-milvus-rootcoord-76fc756c88cv9wr 1/1 Running 3 (5h4m ago) 5h17m 10.104.15.253 4am-node20 <none> <none>
fouram-memory-i80400-4-86-5003-minio-0 1/1 Running 0 5h17m 10.104.30.187 4am-node38 <none> <none>
fouram-memory-i80400-4-86-5003-minio-1 1/1 Running 0 5h17m 10.104.33.6 4am-node36 <none> <none>
fouram-memory-i80400-4-86-5003-minio-2 1/1 Running 0 5h17m 10.104.26.115 4am-node32 <none> <none>
fouram-memory-i80400-4-86-5003-minio-3 1/1 Running 0 5h17m 10.104.18.152 4am-node25 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-0 1/1 Running 0 5h17m 10.104.17.160 4am-node23 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-1 1/1 Running 0 5h17m 10.104.15.22 4am-node20 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-2 1/1 Running 0 5h17m 10.104.31.251 4am-node34 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-init-2m82t 0/1 Completed 0 5h17m 10.104.27.39 4am-node31 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-broker-0 1/1 Running 0 5h17m 10.104.27.41 4am-node31 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-proxy-0 1/1 Running 3 (5h2m ago) 5h17m 10.104.13.26 4am-node16 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-pulsar-init-x7dhc 0/1 Completed 0 5h17m 10.104.23.252 4am-node27 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-recovery-0 1/1 Running 0 5h17m 10.104.4.178 4am-node11 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-zookeeper-0 1/1 Running 0 5h17m 10.104.26.113 4am-node32 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-zookeeper-1 1/1 Running 0 5h16m 10.104.15.27 4am-node20 <none> <none>
fouram-memory-i80400-4-86-5003-pulsar-zookeeper-2 1/1 Running 0 5h15m 10.104.18.175 4am-node25 <none> <none>
grafana:
client error log:
{pod="fouram-memory-index-stab-1713380400-652192920"} |= "ERROR"
statistics of flush execution: 32 failures, 123873 total executions.
'flush': {'Requests': 123873,
'Fails': 32,
'RPS': 6.88,
'fail_s': 0.0,
'RT_max': 58835.13,
'RT_avg': 2821.34,
'TP50': 3000.0,
'TP99': 9500.0},
Expected Behavior
No response
Steps To Reproduce
1. create a collection or use an existing collection
2. build an HNSW index on the vector column
3. insert 100k vectors
4. flush collection
5. build index on vector column with the same parameters
6. count the total number of rows
7. load collection
8. execute concurrent search, query, flush, insert
9. step 8 lasts 5h
Milvus Log
No response
Anything else?
No response
/assign
/unassign
flush seems to be a very frequent activity. Really think we should add write cache in local SSD on datanode to avoid frequent flush
This issue has not come up recently.