milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Querynode panic: assignment to entry in nil map when inserting, deleting, searching concurrently

Open ThreadDao opened this issue 1 year ago • 5 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20230410-bbfa3967
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0.dev4
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Search failed after a while od concurrent operations (include insert, delete, search)

[2023-04-10 12:08:58,211 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046115): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046100): context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-04-10 12:08:29.323471', 'RPC error': '2023-04-10 12:08:58.211594'}> (decorators.py:108)
[2023-04-10 12:08:58,213 - ERROR - fouram]: Traceback (most recent call last):
  File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
    res = func(*args, **kwargs)
  File "/src/fouram/client/util/api_request.py", line 70, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
    res = conn.search(self._name, data, anns_field, param, limit, expr,
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
    return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
    raise pre_err
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests
    raise MilvusException(response.status.error_code, response.status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046115): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046100): context done during sleep after run#6: context deadline exceeded)>
 (api_request.py:48)
[2023-04-10 12:08:58,213 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046115): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046100): context done during sleep after run#6: context deadline exceeded)> (api_request.py:49)

I found querynode have some panic logs: image

Expected Behavior

No response

Steps To Reproduce

1. Create collection and create index
2. Insert 1m data and flush
3. Load collection
4. Concurrent: scene_insert_delete_flush and search. After a while, search failed

Case:

    def test_scale_out_querynode(self, input_params: InputParamsBase):
        """
        :test steps:
            1. deploy and insert, load, index, search
            2. upgrade server: scale-out querynode
            3. search test
        """
        # input_params.upgrade_config = {"spec": {"components": {"querynode": {"replicas": 2}}}}
        input_params.upgrade_config = {"queryNode": {"replicas": 2}}

        # before scale: deploy and concurrent search
        input_params.case_skip_clean_collection = True
        default_case_params = ConcurrentParams().params_scene_concurrent(
            [
                ConcurrentParams.params_search(weight=7, nq=1, top_k=10, search_param={"nprobe": 16}, timeout=300),
                ConcurrentParams.params_scene_insert_delete_flush(
                    weight=1, insert_length=10, delete_length=1, random_id=True, random_vector=True, varchar_filled=True)
            ],
            concurrent_number=[20],
            during_time="1h", interval=20,
            **cdp.DefaultIndexParams.IVF_SQ8)
        self.concurrency_template(input_params=input_params, cpu=2, mem=8,
                                  deploy_mode=CLUSTER, old_version_format=False, sync_report=True,
                                  case_callable_obj=ConcurrentClientBase().scene_concurrent_locust,
                                  default_case_params=default_case_params)

        # scale-out query nodes and concurrent test
        input_params.case_skip_prepare = True
        input_params.case_skip_prepare_clean = True

        self.scale_serial_concurrent_template(input_params, deploy_mode=CLUSTER,
                                              case_callable_after_scale=ConcurrentClientBase().scene_concurrent_locust,
                                              default_case_params=default_case_params)


### Milvus Log

argo workflow name:  fouramf-vpqks
`devops` cluster and `chaos-testing` ns:

fouramf-8m9j6-57-5498-etcd-0 1/1 Running 0 61m 10.102.7.208 devops-node11 fouramf-8m9j6-57-5498-etcd-1 1/1 Running 0 63m 10.102.6.6 devops-node10 fouramf-8m9j6-57-5498-etcd-2 1/1 Running 0 64m 10.102.10.168 devops-node20 fouramf-8m9j6-57-5498-milvus-datacoord-589dcbb888-9j8v8 1/1 Running 1 (130m ago) 134m 10.102.7.78 devops-node11 fouramf-8m9j6-57-5498-milvus-datanode-757d7bb5f6-f99j2 1/1 Running 1 (130m ago) 134m 10.102.7.85 devops-node11 fouramf-8m9j6-57-5498-milvus-indexcoord-6c9b8ff668-lgvqw 1/1 Running 0 134m 10.102.7.73 devops-node11 fouramf-8m9j6-57-5498-milvus-indexnode-7cd9bfd447-gjkg6 1/1 Running 0 134m 10.102.7.52 devops-node11 fouramf-8m9j6-57-5498-milvus-proxy-84cbcd8cc8-fxsjj 1/1 Running 1 (130m ago) 134m 10.102.7.79 devops-node11 fouramf-8m9j6-57-5498-milvus-querycoord-7b969d7c45-wbjmm 1/1 Running 1 (130m ago) 134m 10.102.7.75 devops-node11 fouramf-8m9j6-57-5498-milvus-querynode-6ddfc9dbf-m6pqt 1/1 Running 8 (11m ago) 64m 10.102.6.5 devops-node10 fouramf-8m9j6-57-5498-milvus-querynode-6ddfc9dbf-z2xns 1/1 Running 18 134m 10.102.7.47 devops-node11 fouramf-8m9j6-57-5498-milvus-rootcoord-6bf846bff7-5tthr 1/1 Running 1 (130m ago) 134m 10.102.7.87 devops-node11 fouramf-8m9j6-57-5498-minio-0 1/1 Running 0 134m 10.102.7.140 devops-node11 fouramf-8m9j6-57-5498-minio-1 1/1 Running 0 134m 10.102.6.242 devops-node10 fouramf-8m9j6-57-5498-minio-2 1/1 Running 0 134m 10.102.5.217 devops-node21 fouramf-8m9j6-57-5498-minio-3 1/1 Running 0 134m 10.102.10.165 devops-node20 fouramf-8m9j6-57-5498-pulsar-bookie-0 1/1 Running 0 134m 10.102.7.159 devops-node11 fouramf-8m9j6-57-5498-pulsar-bookie-1 1/1 Running 0 134m 10.102.6.247 devops-node10 fouramf-8m9j6-57-5498-pulsar-bookie-2 1/1 Running 0 134m 10.102.5.220 devops-node21 fouramf-8m9j6-57-5498-pulsar-broker-0 1/1 Running 0 134m 10.102.7.91 devops-node11 fouramf-8m9j6-57-5498-pulsar-proxy-0 1/1 Running 0 134m 10.102.7.94 devops-node11 fouramf-8m9j6-57-5498-pulsar-recovery-0 1/1 Running 0 134m 10.102.7.83 devops-node11 fouramf-8m9j6-57-5498-pulsar-zookeeper-0 1/1 Running 0 134m 10.102.7.116 devops-node11 fouramf-8m9j6-57-5498-pulsar-zookeeper-1 1/1 Running 0 133m 10.102.6.254 devops-node10 fouramf-8m9j6-57-5498-pulsar-zookeeper-2 1/1 Running 0 132m 10.102.9.91 devops-node13


### Anything else?

_No response_

ThreadDao avatar Apr 11 '23 02:04 ThreadDao

/unassign @yanliang567 /assign

congqixia avatar Apr 11 '23 02:04 congqixia

There was two panic error in this scenario. Will fix the nil map one first. The other one was SegV panic. Will take a look into it.

congqixia avatar Apr 11 '23 02:04 congqixia

Patch has been merged. Could you please verify whether this problem still exist? /unassign /assign @ThreadDao

congqixia avatar Apr 11 '23 10:04 congqixia

@congqixia panic problem fixed, verified image master-20230412-296380d6. but the verify has other question #23444.

ThreadDao avatar Apr 14 '23 12:04 ThreadDao

@ThreadDao can we close this issue and move on to next issue?

congqixia avatar Apr 15 '23 00:04 congqixia