milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark] Some flush 30s timeout failures during concurrency testing

Open elstic opened this issue 10 months ago • 3 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.4-20240417-8f7ac8f7
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

case : test_concurrent_locust_hnsw_dql_filter_insert_cluster argo task : fouram-memory-index-stab-1713380400, id: 4

4am cluster, qa-milvus ns , server:

fouram-memory-i80400-4-86-5003-etcd-0                             1/1     Running                           0                5h17m   10.104.26.116   4am-node32   <none>           <none>
fouram-memory-i80400-4-86-5003-etcd-1                             1/1     Running                           0                5h17m   10.104.18.151   4am-node25   <none>           <none>
fouram-memory-i80400-4-86-5003-etcd-2                             1/1     Running                           0                5h17m   10.104.28.53    4am-node33   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-datacoord-548bbdccb8l7pbw   1/1     Running                           0                5h17m   10.104.9.170    4am-node14   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-datanode-5b666d79ff-jzlf9   1/1     Running                           3 (5h4m ago)     5h17m   10.104.19.81    4am-node28   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-indexcoord-fbdc465474n787   1/1     Running                           0                5h17m   10.104.23.254   4am-node27   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-indexnode-5c6b45c578b29pv   1/1     Running                           0                5h17m   10.104.32.250   4am-node39   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-proxy-84dddcc47f-82glq      1/1     Running                           3 (5h4m ago)     5h17m   10.104.24.93    4am-node29   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-querycoord-59f6d94649z5p9   1/1     Running                           3 (5h4m ago)     5h17m   10.104.24.92    4am-node29   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-querynode-549cf45db4k2cgr   1/1     Running                           0                5h17m   10.104.23.2     4am-node27   <none>           <none>
fouram-memory-i80400-4-86-5003-milvus-rootcoord-76fc756c88cv9wr   1/1     Running                           3 (5h4m ago)     5h17m   10.104.15.253   4am-node20   <none>           <none>
fouram-memory-i80400-4-86-5003-minio-0                            1/1     Running                           0                5h17m   10.104.30.187   4am-node38   <none>           <none>
fouram-memory-i80400-4-86-5003-minio-1                            1/1     Running                           0                5h17m   10.104.33.6     4am-node36   <none>           <none>
fouram-memory-i80400-4-86-5003-minio-2                            1/1     Running                           0                5h17m   10.104.26.115   4am-node32   <none>           <none>
fouram-memory-i80400-4-86-5003-minio-3                            1/1     Running                           0                5h17m   10.104.18.152   4am-node25   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-0                    1/1     Running                           0                5h17m   10.104.17.160   4am-node23   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-1                    1/1     Running                           0                5h17m   10.104.15.22    4am-node20   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-2                    1/1     Running                           0                5h17m   10.104.31.251   4am-node34   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-bookie-init-2m82t           0/1     Completed                         0                5h17m   10.104.27.39    4am-node31   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-broker-0                    1/1     Running                           0                5h17m   10.104.27.41    4am-node31   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-proxy-0                     1/1     Running                           3 (5h2m ago)     5h17m   10.104.13.26    4am-node16   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-pulsar-init-x7dhc           0/1     Completed                         0                5h17m   10.104.23.252   4am-node27   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-recovery-0                  1/1     Running                           0                5h17m   10.104.4.178    4am-node11   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-zookeeper-0                 1/1     Running                           0                5h17m   10.104.26.113   4am-node32   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-zookeeper-1                 1/1     Running                           0                5h16m   10.104.15.27    4am-node20   <none>           <none>
fouram-memory-i80400-4-86-5003-pulsar-zookeeper-2                 1/1     Running                           0                5h15m   10.104.18.175   4am-node25   <none>           <none>

grafana: image

client error log: {pod="fouram-memory-index-stab-1713380400-652192920"} |= "ERROR" image

statistics of flush execution: 32 failures, 123873 total executions.

'flush': {'Requests': 123873,
   'Fails': 32,
   'RPS': 6.88,
   'fail_s': 0.0,
   'RT_max': 58835.13,
   'RT_avg': 2821.34,
   'TP50': 3000.0,
   'TP99': 9500.0},

Expected Behavior

No response

Steps To Reproduce

1. create a collection or use an existing collection  
  2. build an HNSW index on the vector column
  3. insert 100k vectors
  4. flush collection
  5. build index on vector column with the same parameters  
  6. count the total number of rows
  7. load collection
  8. execute concurrent search, query, flush, insert 
  9. step 8 lasts 5h

Milvus Log

No response

Anything else?

No response

elstic avatar Apr 18 '24 02:04 elstic

/assign

XuanYang-cn avatar Apr 18 '24 03:04 XuanYang-cn

/unassign

yanliang567 avatar Apr 18 '24 07:04 yanliang567

flush seems to be a very frequent activity. Really think we should add write cache in local SSD on datanode to avoid frequent flush

xiaofan-luan avatar Apr 18 '24 18:04 xiaofan-luan

This issue has not come up recently.

elstic avatar Aug 22 '24 02:08 elstic