milvus
milvus copied to clipboard
[Bug]: Querynode panic: assignment to entry in nil map when inserting, deleting, searching concurrently
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master-20230410-bbfa3967
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0.dev4
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Search failed after a while od concurrent operations (include insert, delete, search)
[2023-04-10 12:08:58,211 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046115): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046100): context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-04-10 12:08:29.323471', 'RPC error': '2023-04-10 12:08:58.211594'}> (decorators.py:108)
[2023-04-10 12:08:58,213 - ERROR - fouram]: Traceback (most recent call last):
File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
res = func(*args, **kwargs)
File "/src/fouram/client/util/api_request.py", line 70, in api_request
return func(*arg, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
res = conn.search(self._name, data, anns_field, param, limit, expr,
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
raise pre_err
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests
raise MilvusException(response.status.error_code, response.status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046115): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046100): context done during sleep after run#6: context deadline exceeded)>
(api_request.py:48)
[2023-04-10 12:08:58,213 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046115): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_0_440697645451444802v0 is not available in any replica, err=LackSegment(segmentID=440697645453046100): context done during sleep after run#6: context deadline exceeded)> (api_request.py:49)
I found querynode have some panic logs:
Expected Behavior
No response
Steps To Reproduce
1. Create collection and create index
2. Insert 1m data and flush
3. Load collection
4. Concurrent: scene_insert_delete_flush and search. After a while, search failed
Case:
def test_scale_out_querynode(self, input_params: InputParamsBase):
"""
:test steps:
1. deploy and insert, load, index, search
2. upgrade server: scale-out querynode
3. search test
"""
# input_params.upgrade_config = {"spec": {"components": {"querynode": {"replicas": 2}}}}
input_params.upgrade_config = {"queryNode": {"replicas": 2}}
# before scale: deploy and concurrent search
input_params.case_skip_clean_collection = True
default_case_params = ConcurrentParams().params_scene_concurrent(
[
ConcurrentParams.params_search(weight=7, nq=1, top_k=10, search_param={"nprobe": 16}, timeout=300),
ConcurrentParams.params_scene_insert_delete_flush(
weight=1, insert_length=10, delete_length=1, random_id=True, random_vector=True, varchar_filled=True)
],
concurrent_number=[20],
during_time="1h", interval=20,
**cdp.DefaultIndexParams.IVF_SQ8)
self.concurrency_template(input_params=input_params, cpu=2, mem=8,
deploy_mode=CLUSTER, old_version_format=False, sync_report=True,
case_callable_obj=ConcurrentClientBase().scene_concurrent_locust,
default_case_params=default_case_params)
# scale-out query nodes and concurrent test
input_params.case_skip_prepare = True
input_params.case_skip_prepare_clean = True
self.scale_serial_concurrent_template(input_params, deploy_mode=CLUSTER,
case_callable_after_scale=ConcurrentClientBase().scene_concurrent_locust,
default_case_params=default_case_params)
### Milvus Log
argo workflow name: fouramf-vpqks
`devops` cluster and `chaos-testing` ns:
fouramf-8m9j6-57-5498-etcd-0 1/1 Running 0 61m 10.102.7.208 devops-node11
### Anything else?
_No response_
/unassign @yanliang567 /assign
There was two panic error in this scenario. Will fix the nil map one first. The other one was SegV panic. Will take a look into it.
Patch has been merged. Could you please verify whether this problem still exist? /unassign /assign @ThreadDao
@congqixia
panic problem fixed, verified image master-20230412-296380d6
.
but the verify has other question #23444.
@ThreadDao can we close this issue and move on to next issue?