milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark] Some flush 600s timeout failures during concurrency testing

Open elstic opened this issue 1 year ago • 7 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.3-20240103-6bf46c6f
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After inserting 100,000 vectors, the concurrent search, query, insert, flush continues, with about 0.5% of the flushes failing due to timeouts.

case: test_concurrent_locust_hnsw_dql_filter_insert_standalone argo task : fouram-memory-index-stab-1704308400, id : 3

server:

fouram-memory-i08400-3-20-4598-etcd-0                             1/1     Running                  0                  5h7m    10.104.18.126   4am-node25   <none>           <none>
fouram-memory-i08400-3-20-4598-milvus-standalone-7d9f74c99z62zd   1/1     Running                  1 (5h2m ago)       5h7m    10.104.16.77    4am-node21   <none>           <none>
fouram-memory-i08400-3-20-4598-minio-786874cdb9-f2dv2             1/1     Running                  0                  5h7m    10.104.16.78    4am-node21   <none>           <none> (cli_client.py:132)

client error log: { pod=~"fouram-memory-index-stab-1704308400-4111680801"} |= "ERROR" image

client result (flush part):

 'flush': {'Requests': 95815,
     'Fails': 499,
     'RPS': 5.32,
     'fail_s': 0.01,
     'RT_max': 145886.97,
     'RT_avg': 3496.96,
     'TP50': 2700.0,
     'TP99': 30000.0},

Expected Behavior

No flush fail

Steps To Reproduce

1. create a collection or use an existing collection  
  2. build an HNSW index on the vector column
  3. insert 100k vectors
  4. flush collection
  5. build index on vector column with the same parameters  
  6. count the total number of rows
  7. load collection
  8. execute concurrent search, query, flush, insert 
  9. step 8 lasts 5h

Milvus Log

No response

Anything else?

No response

elstic avatar Jan 04 '24 03:01 elstic

/assign @congqixia /unassign

yanliang567 avatar Jan 04 '24 11:01 yanliang567

same issue: flush 600s timeout on image 2.3-20240108-8c2ca3fb-amd64

  1. create collection with a partition_key field int64_pk_5b and 128 num_partitions
  2. create DISKANN index
  3. insert 10m-768d data -> flush -> index again -> load
  4. concurrent insert + delete + flush + query + search -> 26% flush requests of 914 timeout 600s
[2024-01-09 05:09:49,785 - WARNING - fouram]: [93mRetry timeout: 600s[0m (decorators.py:94)
[2024-01-09 05:09:49,785 - ERROR - fouram]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 600s, message=wait for flush timeout, collection: laion_coll_4)>, <Time:{'RPC start': '2024-01-09 04:58:32.041334', 'RPC error': '2024-01-09 05:09:49.785849'}> (decorators.py:128)
[2024-01-09 05:09:49,787 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 600s, message=wait for flush timeout, collection: laion_coll_4)> (api_request.py:54)
[2024-01-09 05:09:49,787 - ERROR - fouram]: [CheckFunc] flush request check failed, response:<MilvusException: (code=1, message=Retry timeout: 600s, message=wait for flush timeout, collection: laion_coll_4)> (func_check.py:48)
'concurrent_params': {'concurrent_number': 100,
                                                       'during_time': '10h',
                                                       'interval': 120,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'query',
                                                       'weight': 2,
                                                       'params': {'expr': 'id '
                                                                          '> '
                                                                          '5000000',
                                                                  'timeout': 600}},
                                                      {'type': 'search',
                                                       'weight': 3,
                                                       'params': {'nq': 100,
                                                                  'top_k': 100,
                                                                  'search_param': {'search_list': 100},
                                                                  'timeout': 600}},
                                                      {'type': 'insert',
                                                       'weight': 2,
                                                       'params': {'nb': 200,
                                                                  'start_id': 5000000,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'timeout': 600}},
                                                      {'type': 'delete',
                                                       'weight': 2,
                                                       'params': {'delete_length': 100,
                                                                  'timeout': 600}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 600}}]},
  • 4am pods:
stable-insert-write-milvus-datanode-769ffd74c-f4fxh       Running     0            1m      10.104.13.29      4am-node16     
stable-insert-write-milvus-datanode-769ffd74c-wjdlv       Running     0            1m      10.104.31.247     4am-node34     
stable-insert-write-milvus-indexnode-5b75cbd6bb-9cndv     Running     0            1m      10.104.33.13      4am-node36     
stable-insert-write-milvus-indexnode-5b75cbd6bb-fnflx     Running     0            1m      10.104.25.70      4am-node30     
stable-insert-write-milvus-mixcoord-7bfc844ff8-gb5xq      Running     0            1m      10.104.26.42      4am-node32     
stable-insert-write-milvus-proxy-5f9b6bd-fp747            Running     0            1m      10.104.19.219     4am-node28     
stable-insert-write-milvus-querynode-7f797f97f7-7jwq8     Running     0            1m      10.104.32.82      4am-node39     
stable-insert-write-milvus-querynode-7f797f97f7-7vl7w     Running     0            1m      10.104.31.248     4am-node34     
stable-insert-write-milvus-querynode-7f797f97f7-rrnqk     Running     0            1m      10.104.24.208     4am-node29 

ThreadDao avatar Jan 10 '24 08:01 ThreadDao

Some flushes fail during concurrency testing

case: test_concurrent_locust_diskann_compaction_cluster image: 2.3-20240115-f2a0c3f0-amd64

test steps:

1. create a collection or use an existing collection  
  2. build an DiskANN index on the vector column
  3. insert 100k vectors
  4. flush collection
  5. build index on vector column with the same parameters  
  6. count the total number of rows
  7. load collection
  8. execute concurrent search, query, flush, insert ,delete,load
  9. step 8 lasts 5h

client pod:fouram-disk-stab-1705352400-1063076031 client error log:

[2024-01-15 21:43:53,572 - ERROR - fouram]: RPC error: [flush], <MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:116 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush: attempt #0: session not found->datanode-6: node not found[node=6]: unrecoverable error)>, <Time:{'RPC start': '2024-01-15 21:43:53.561926', 'RPC error': '2024-01-15 21:43:53.572594'}> (decorators.py:134)
[2024-01-15 21:43:53,574 - ERROR - fouram]: (api_response) : <MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:116 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush: attempt #0: session not found->datanode-6: node not found[node=6]: unrecoverable error)> (api_request.py:54)
[2024-01-15 21:43:53,574 - ERROR - fouram]: [CheckFunc] flush request check failed, response:<MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:116 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush: attempt #0: session not found->datanode-6: node not found[node=6]: unrecoverable error)> (func_check.py:54)
[2024-01-15 21:43:53,574 - ERROR - fouram]: [func_time_catch] :  (api_request.py:121)
[2024-01-15 21:43:53,866 - ERROR - fouram]: RPC error: [flush], <MilvusException: (code=65535, message=attempt #0: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:231 github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/session_manager.go:323 github.com/milvus-io/milvus/internal/datacoord.(*SessionManagerImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/cluster.go:146 github.com/milvus-io/milvus/internal/datacoord.(*ClusterImpl).FlushChannels
/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:120 github.com/milvus-io/milvus/internal/datacoord.(*Server).Flush.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do

server:

fouram-disk-sta52400-3-85-5049-etcd-0                             1/1     Running                  0                 5h7m    10.104.31.66    4am-node34   <none>           <none>
fouram-disk-sta52400-3-85-5049-etcd-1                             1/1     Running                  0                 5h7m    10.104.26.177   4am-node32   <none>           <none>
fouram-disk-sta52400-3-85-5049-etcd-2                             1/1     Running                  0                 5h7m    10.104.32.33    4am-node39   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-datacoord-5b57f7b68fr9rjl   1/1     Running                  0                 5h7m    10.104.32.24    4am-node39   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-datanode-57b88768cb-kvf57   0/1     Completed                1 (5h3m ago)      5h7m    10.104.23.88    4am-node27   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-datanode-57b88768cb-xslfq   1/1     Running                  0                 4h26m   10.104.18.196   4am-node25   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-indexcoord-c5cbbc7d8skzzz   1/1     Running                  0                 5h7m    10.104.31.58    4am-node34   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-indexnode-5b5956c49fm5jth   1/1     Running                  0                 5h7m    10.104.19.81    4am-node28   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-proxy-678db75584-jfmvn      1/1     Running                  1 (5h3m ago)      5h7m    10.104.32.25    4am-node39   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-querycoord-bb545d46fzg6vd   1/1     Running                  1 (5h3m ago)      5h7m    10.104.32.23    4am-node39   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-querynode-6787887c975v5bk   1/1     Running                  0                 5h7m    10.104.16.201   4am-node21   <none>           <none>
fouram-disk-sta52400-3-85-5049-milvus-rootcoord-55cf8d85fc2tmhz   1/1     Running                  1 (5h3m ago)      5h7m    10.104.16.199   4am-node21   <none>           <none>
fouram-disk-sta52400-3-85-5049-minio-0                            1/1     Running                  0                 5h7m    10.104.31.69    4am-node34   <none>           <none>
fouram-disk-sta52400-3-85-5049-minio-1                            1/1     Running                  0                 5h7m    10.104.24.229   4am-node29   <none>           <none>
fouram-disk-sta52400-3-85-5049-minio-2                            1/1     Running                  0                 5h7m    10.104.20.25    4am-node22   <none>           <none>
fouram-disk-sta52400-3-85-5049-minio-3                            1/1     Running                  0                 5h7m    10.104.25.165   4am-node30   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-0                    1/1     Running                  0                 5h7m    10.104.26.176   4am-node32   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-1                    1/1     Running                  0                 5h7m    10.104.32.34    4am-node39   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-2                    1/1     Running                  0                 5h7m    10.104.24.232   4am-node29   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-bookie-init-qfmnk           0/1     Completed                0                 5h7m    10.104.5.212    4am-node12   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-broker-0                    1/1     Running                  0                 5h7m    10.104.5.213    4am-node12   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-proxy-0                     1/1     Running                  0                 5h7m    10.104.9.29     4am-node14   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-pulsar-init-cdhjw           0/1     Completed                0                 5h7m    10.104.1.9      4am-node10   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-recovery-0                  1/1     Running                  0                 5h7m    10.104.12.5     4am-node17   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-zookeeper-0                 1/1     Running                  0                 5h7m    10.104.24.228   4am-node29   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-zookeeper-1                 1/1     Running                  0                 5h6m    10.104.19.91    4am-node28   <none>           <none>
fouram-disk-sta52400-3-85-5049-pulsar-zookeeper-2                 1/1     Running                  0                 5h5m    10.104.29.159   4am-node35   <none>           <none> (cli_client.py:132)

elstic avatar Jan 16 '24 03:01 elstic

/assign @sunby

xiaofan-luan avatar Jan 16 '24 14:01 xiaofan-luan

There seems to some bug when concurrent flush happened.

xiaofan-luan avatar Feb 04 '24 05:02 xiaofan-luan

/assign @xiaocai2333

xiaofan-luan avatar Feb 04 '24 05:02 xiaofan-luan

/assign @elstic This issue may be fixed. Please verify it with newest image.

xiaocai2333 avatar May 06 '24 02:05 xiaocai2333

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 15 '24 16:06 stale[bot]