milvus
milvus copied to clipboard
[Bug]: [benchmark] Some flush 600s timeout failures during concurrency testing
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.3-20240103-6bf46c6f
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
After inserting 100,000 vectors, the concurrent search, query, insert, flush continues, with about 0.5% of the flushes failing due to timeouts.
case: test_concurrent_locust_hnsw_dql_filter_insert_standalone argo task : fouram-memory-index-stab-1704308400, id : 3
server:
fouram-memory-i08400-3-20-4598-etcd-0 1/1 Running 0 5h7m 10.104.18.126 4am-node25 <none> <none>
fouram-memory-i08400-3-20-4598-milvus-standalone-7d9f74c99z62zd 1/1 Running 1 (5h2m ago) 5h7m 10.104.16.77 4am-node21 <none> <none>
fouram-memory-i08400-3-20-4598-minio-786874cdb9-f2dv2 1/1 Running 0 5h7m 10.104.16.78 4am-node21 <none> <none> (cli_client.py:132)
client error log:
{ pod=~"fouram-memory-index-stab-1704308400-4111680801"} |= "ERROR"
client result (flush part):
'flush': {'Requests': 95815,
'Fails': 499,
'RPS': 5.32,
'fail_s': 0.01,
'RT_max': 145886.97,
'RT_avg': 3496.96,
'TP50': 2700.0,
'TP99': 30000.0},
Expected Behavior
No flush fail
Steps To Reproduce
1. create a collection or use an existing collection
2. build an HNSW index on the vector column
3. insert 100k vectors
4. flush collection
5. build index on vector column with the same parameters
6. count the total number of rows
7. load collection
8. execute concurrent search, query, flush, insert
9. step 8 lasts 5h
Milvus Log
No response
Anything else?
No response
/assign @congqixia /unassign
same issue: flush 600s timeout on image 2.3-20240108-8c2ca3fb-amd64
-
argo: stable-insert-write-1
-
grafana: grafana metrics
-
test steps:
- create collection with a partition_key field
int64_pk_5b
and 128 num_partitions - create DISKANN index
- insert 10m-768d data -> flush -> index again -> load
- concurrent insert + delete + flush + query + search -> 26% flush requests of 914 timeout 600s
[2024-01-09 05:09:49,785 - WARNING - fouram]: [93mRetry timeout: 600s[0m (decorators.py:94)
[2024-01-09 05:09:49,785 - ERROR - fouram]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 600s, message=wait for flush timeout, collection: laion_coll_4)>, <Time:{'RPC start': '2024-01-09 04:58:32.041334', 'RPC error': '2024-01-09 05:09:49.785849'}> (decorators.py:128)
[2024-01-09 05:09:49,787 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 600s, message=wait for flush timeout, collection: laion_coll_4)> (api_request.py:54)
[2024-01-09 05:09:49,787 - ERROR - fouram]: [CheckFunc] flush request check failed, response:<MilvusException: (code=1, message=Retry timeout: 600s, message=wait for flush timeout, collection: laion_coll_4)> (func_check.py:48)
'concurrent_params': {'concurrent_number': 100,
'during_time': '10h',
'interval': 120,
'spawn_rate': None},
'concurrent_tasks': [{'type': 'query',
'weight': 2,
'params': {'expr': 'id '
'> '
'5000000',
'timeout': 600}},
{'type': 'search',
'weight': 3,
'params': {'nq': 100,
'top_k': 100,
'search_param': {'search_list': 100},
'timeout': 600}},
{'type': 'insert',
'weight': 2,
'params': {'nb': 200,
'start_id': 5000000,
'random_id': True,
'random_vector': True,
'timeout': 600}},
{'type': 'delete',
'weight': 2,
'params': {'delete_length': 100,
'timeout': 600}},
{'type': 'flush',
'weight': 1,
'params': {'timeout': 600}}]},
- 4am pods:
stable-insert-write-milvus-datanode-769ffd74c-f4fxh Running 0 1m 10.104.13.29 4am-node16
stable-insert-write-milvus-datanode-769ffd74c-wjdlv Running 0 1m 10.104.31.247 4am-node34
stable-insert-write-milvus-indexnode-5b75cbd6bb-9cndv Running 0 1m 10.104.33.13 4am-node36
stable-insert-write-milvus-indexnode-5b75cbd6bb-fnflx Running 0 1m 10.104.25.70 4am-node30
stable-insert-write-milvus-mixcoord-7bfc844ff8-gb5xq Running 0 1m 10.104.26.42 4am-node32
stable-insert-write-milvus-proxy-5f9b6bd-fp747 Running 0 1m 10.104.19.219 4am-node28
stable-insert-write-milvus-querynode-7f797f97f7-7jwq8 Running 0 1m 10.104.32.82 4am-node39
stable-insert-write-milvus-querynode-7f797f97f7-7vl7w Running 0 1m 10.104.31.248 4am-node34
stable-insert-write-milvus-querynode-7f797f97f7-rrnqk Running 0 1m 10.104.24.208 4am-node29
Some flushes fail during concurrency testing
case: test_concurrent_locust_diskann_compaction_cluster image: 2.3-20240115-f2a0c3f0-amd64
test steps:
1. create a collection or use an existing collection
2. build an DiskANN index on the vector column
3. insert 100k vectors
4. flush collection
5. build index on vector column with the same parameters
6. count the total number of rows
7. load collection
8. execute concurrent search, query, flush, insert ,delete,load
9. step 8 lasts 5h
client pod:fouram-disk-stab-1705352400-1063076031 client error log:
[2024-01-15 21:43:53,572 - ERROR - fouram]: RPC error: [flush], <MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:116 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush: attempt #0: session not found->datanode-6: node not found[node=6]: unrecoverable error)>, <Time:{'RPC start': '2024-01-15 21:43:53.561926', 'RPC error': '2024-01-15 21:43:53.572594'}> (decorators.py:134)
[2024-01-15 21:43:53,574 - ERROR - fouram]: (api_response) : <MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:116 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush: attempt #0: session not found->datanode-6: node not found[node=6]: unrecoverable error)> (api_request.py:54)
[2024-01-15 21:43:53,574 - ERROR - fouram]: [CheckFunc] flush request check failed, response:<MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:116 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush: attempt #0: session not found->datanode-6: node not found[node=6]: unrecoverable error)> (func_check.py:54)
[2024-01-15 21:43:53,574 - ERROR - fouram]: [func_time_catch] : (api_request.py:121)
[2024-01-15 21:43:53,866 - ERROR - fouram]: RPC error: [flush], <MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
server:
fouram-disk-sta52400-3-85-5049-etcd-0 1/1 Running 0 5h7m 10.104.31.66 4am-node34 <none> <none>
fouram-disk-sta52400-3-85-5049-etcd-1 1/1 Running 0 5h7m 10.104.26.177 4am-node32 <none> <none>
fouram-disk-sta52400-3-85-5049-etcd-2 1/1 Running 0 5h7m 10.104.32.33 4am-node39 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-datacoord-5b57f7b68fr9rjl 1/1 Running 0 5h7m 10.104.32.24 4am-node39 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-datanode-57b88768cb-kvf57 0/1 Completed 1 (5h3m ago) 5h7m 10.104.23.88 4am-node27 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-datanode-57b88768cb-xslfq 1/1 Running 0 4h26m 10.104.18.196 4am-node25 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-indexcoord-c5cbbc7d8skzzz 1/1 Running 0 5h7m 10.104.31.58 4am-node34 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-indexnode-5b5956c49fm5jth 1/1 Running 0 5h7m 10.104.19.81 4am-node28 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-proxy-678db75584-jfmvn 1/1 Running 1 (5h3m ago) 5h7m 10.104.32.25 4am-node39 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-querycoord-bb545d46fzg6vd 1/1 Running 1 (5h3m ago) 5h7m 10.104.32.23 4am-node39 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-querynode-6787887c975v5bk 1/1 Running 0 5h7m 10.104.16.201 4am-node21 <none> <none>
fouram-disk-sta52400-3-85-5049-milvus-rootcoord-55cf8d85fc2tmhz 1/1 Running 1 (5h3m ago) 5h7m 10.104.16.199 4am-node21 <none> <none>
fouram-disk-sta52400-3-85-5049-minio-0 1/1 Running 0 5h7m 10.104.31.69 4am-node34 <none> <none>
fouram-disk-sta52400-3-85-5049-minio-1 1/1 Running 0 5h7m 10.104.24.229 4am-node29 <none> <none>
fouram-disk-sta52400-3-85-5049-minio-2 1/1 Running 0 5h7m 10.104.20.25 4am-node22 <none> <none>
fouram-disk-sta52400-3-85-5049-minio-3 1/1 Running 0 5h7m 10.104.25.165 4am-node30 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-0 1/1 Running 0 5h7m 10.104.26.176 4am-node32 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-1 1/1 Running 0 5h7m 10.104.32.34 4am-node39 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-2 1/1 Running 0 5h7m 10.104.24.232 4am-node29 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-init-qfmnk 0/1 Completed 0 5h7m 10.104.5.212 4am-node12 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-broker-0 1/1 Running 0 5h7m 10.104.5.213 4am-node12 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-proxy-0 1/1 Running 0 5h7m 10.104.9.29 4am-node14 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-pulsar-init-cdhjw 0/1 Completed 0 5h7m 10.104.1.9 4am-node10 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-recovery-0 1/1 Running 0 5h7m 10.104.12.5 4am-node17 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-zookeeper-0 1/1 Running 0 5h7m 10.104.24.228 4am-node29 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-zookeeper-1 1/1 Running 0 5h6m 10.104.19.91 4am-node28 <none> <none>
fouram-disk-sta52400-3-85-5049-pulsar-zookeeper-2 1/1 Running 0 5h5m 10.104.29.159 4am-node35 <none> <none> (cli_client.py:132)
/assign @sunby
There seems to some bug when concurrent flush happened.
/assign @xiaocai2333
/assign @elstic This issue may be fixed. Please verify it with newest image.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.