milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Standalone pod restarted several times when concurrent insert and search multi collections

Open ThreadDao opened this issue 1 year ago • 8 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.2.0-20230504-842e5d21
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):  rocksmq  
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.2.8.dev1
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. deploy standalone with image 2.2.0-20230504-842e5d21 and config
  config:
    log:
      level: debug
    quotaAndLimits:
      limitWriting:
        diskProtection:
          diskQuotaPerCollection: 500
  1. Test concurrent create collection and insert search
  • create collection
  • insert 1m-128d data and ni is 10000
  • flush and get num entities
  • build hnsw index: {"M": 8, "efConstruction": 200}
  • load collection
  • Search with nq=10, top_k=100 for 5000 times
  1. standalone pod restart 4 times

panic logs: image

Expected Behavior

No response

Steps To Reproduce

fouram argo name: `quota-collections-3`

run fouramf case:

    @pytest.mark.locust
    @pytest.mark.parametrize("deploy_mode", [STANDALONE])
    def test_concurrent_locust_multi_collections(self, input_params: InputParamsBase, deploy_mode):
        """
        Used to check whether the memory usage of queryNodes is balanced.

        :test steps:
            1. concurrent test and calculation of RT and QPS
        """
        concurrent_tasks = [
            ConcurrentParams.params_scene_search_test(
                weight=5, shards_num=2, data_size='1m', nb=10000, replica_number=1,
                index_type=pn.IndexTypeName.HNSW, index_param={"M": 8, "efConstruction": 200}, nq=10, top_k=100, search_param={"ef": 34},
                search_counts=5000)
        ]
        default_case_params = ConcurrentParams().params_scene_concurrent(
            concurrent_tasks, concurrent_number=[50], during_time="5h", interval=20, dataset_size=0, ni_per=0,
            replica_number=1, **cdp.DefaultIndexParams.HNSW)

        self.concurrency_template(input_params=input_params, cpu=dp.min_cpu, mem=dp.min_mem,
                                  deploy_mode=deploy_mode, old_version_format=False,
                                  case_callable_obj=ConcurrentClientBase().scene_concurrent_locust,
                                  default_case_params=default_case_params)


### Milvus Log

server pods in `fouram` cluster and `qa-milvus` ns:

k get pod -o wide -n qa-milvus | grep fouram-op-54-8249 fouram-op-54-8249-etcd-0 1/1 Running 0 28h 10.104.4.130 4am-node11 fouram-op-54-8249-milvus-standalone-6b454c485b-bgfnx 1/1 Running 4 (101m ago) 28h 10.104.4.151 4am-node11 fouram-op-54-8249-minio-744659cbdf-h5xlr 1/1 Running 0 28h 10.104.4.131 4am-node11


client pod in `fouram` cluster and `qa` ns:

quota-collections-3-1904905947


[standalone_pre.log](https://github.com/milvus-io/milvus/files/11406741/standalone_pre.log)


### Anything else?

_No response_

ThreadDao avatar May 05 '23 13:05 ThreadDao

/assign @jiaoew1991 /unassign

yanliang567 avatar May 06 '23 01:05 yanliang567

/assign @yah01 /unassign

jiaoew1991 avatar May 06 '23 09:05 jiaoew1991

image concurrent write/read map

yah01 avatar May 09 '23 02:05 yah01

#23957 has fixed this

yah01 avatar May 09 '23 02:05 yah01

/assign @ThreadDao plz help check with #23957

yah01 avatar May 09 '23 02:05 yah01

rerun-image: 2.2.0-20230509-341b62d5 standalone also restarted, one oomkilled and other is completed with 0 exit code image

I stop the test and update standalone pod memory from 16G to 20G, it also crashed

fouram-op-54-8249-etcd-0                                          1/1     Running                  0                5d      10.104.4.130    4am-node11   <none>           <none>
fouram-op-54-8249-milvus-standalone-865768cb7c-vhvsm              0/1     Running                  10 (5m32s ago)   40m     10.104.4.170    4am-node11   <none>           <none>
fouram-op-54-8249-minio-744659cbdf-h5xlr                          1/1     Running                  0                5d      10.104.4.131    4am-node11   <none>           <none>

standalone pod previous log: standalone_pre_1.log

ThreadDao avatar May 09 '23 09:05 ThreadDao

/assign @yah01 plz help to check whether it is caused by insufficient memory? if yew, why the exit code is 0

ThreadDao avatar May 09 '23 09:05 ThreadDao

/assign @yah01 /unassign

image: 2.2.0-20230512-d882624b the standalone pod also crash

fouram-op-54-8249-etcd-0                                          1/1     Running            0                 8d
fouram-op-54-8249-milvus-standalone-576d456c9d-7z9sb              0/1     Running            2 (45s ago)       6m56s
fouram-op-54-8249-minio-744659cbdf-h5xlr                          1/1     Running            0                 8d

standalone_pre_1.log

ThreadDao avatar May 12 '23 10:05 ThreadDao

standalone also crash, exit code 0 image

pre log: Uploading standalone_pre_completed.log…

ThreadDao avatar May 17 '23 08:05 ThreadDao

@yah01 any updates?

binbinlv avatar May 24 '23 07:05 binbinlv

@yah01 any updates?

related #24489

yah01 avatar May 29 '23 10:05 yah01

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Aug 03 '23 03:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 04 '23 06:09 stale[bot]