milvus
milvus copied to clipboard
[Bug]: Failed to search: node offline[node=-1]: channel not available when `streamingDeltaForwardPolicy` is `Direct`
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.4-20241010-eaa94875-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
milvus cluster
deploy a milvus with config
config:
dataCoord:
enableActiveStandby: true
segment:
expansionRate: 1.15
maxSize: 2048
sealProportion: 0.12
dataNode:
compaction:
levelZeroBatchMemoryRatio: 0.5
indexCoord:
enableActiveStandby: true
log:
level: debug
minio:
accessKeyID: miniozong
bucketName: bucket-zong
rootPath: compact_2
secretAccessKey: miniozong
queryCoord:
enableActiveStandby: true
**queryNode:
levelZeroForwardPolicy: RemoteLoad
streamingDeltaForwardPolicy: Direct**
quotaAndLimits:
dml:
deleteRate:
max: 0.5
enabled: false
insertRate:
max: 8
upsertRate:
max: 8
growingSegmentsSizeProtection:
enabled: false
highWaterLevel: 0.2
lowWaterLevel: 0.1
limitWriting:
memProtection:
dataNodeMemoryHighWaterLevel: 0.85
dataNodeMemoryLowWaterLevel: 0.75
queryNodeMemoryHighWaterLevel: 0.85
queryNodeMemoryLowWaterLevel: 0.75
limits:
complexDeleteLimitEnable: true
rootCoord:
enableActiveStandby: true
trace:
exporter: jaeger
jaeger:
url: http://tempo-distributor.tempo:14268/api/traces"
sampleFraction: 1
test steps
- There are a collection with a int64 pk field and a vector field. Collection has 100m entities
- When starting to delete, the search fails
[2024-10-15 10:48:02,882 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=fail to search on QueryNode 23: distribution is not servcieable: channel not available[channel=compact-opt-100m-2-rootcoord-dml_0_453128445902192997v0])>, <Time:{'RPC start': '2024-10-15 10:47:45.846202', 'RPC error': '2024-10-15 10:48:02.882154'}> (decorators.py:147)
client delete log:
[2024-10-15 18:46:51,711 - INFO - ci_test]: start to delete [0, ..., 15999] with length 16000 (tmp.py:40)
[2024-10-15 18:46:51,825 - INFO - ci_test]: delete cost 0.11316943168640137 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
[2024-10-15 18:46:52,716 - INFO - ci_test]: start to delete [16000, ..., 31999] with length 16000 (tmp.py:40)
...
[2024-10-15 18:51:55,817 - INFO - ci_test]: delete cost 0.11813139915466309 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
[2024-10-15 18:51:56,703 - INFO - ci_test]: start to delete [4864000, ..., 4879999] with length 16000 (tmp.py:40)
[2024-10-15 18:51:56,825 - INFO - ci_test]: delete cost 0.12181949615478516 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
Expected Behavior
No response
Steps To Reproduce
- https://argo-workflows.zilliz.cc/archived-workflows/qa/64dab658-11fc-4a63-ac02-8770c303363f?nodeId=compact-opt-delete-100m-6b
- delete scripts
def get_ids(start, end, batch):
while True:
batch = min(batch, end - start)
if start >= end:
yield None
ids = [i for i in range(start, start+batch)]
start = start + len(ids)
yield ids
def delete_with_rate(_host, _name, _start, _end, _batch, pk="id"):
connections.connect(host=_host)
c = Collection(name=_name)
for ids in get_ids(_start, _end, _batch):
if ids is None:
break
log.info(f"start to delete [{ids[0]}, ..., {ids[-1]}] with length {len(ids)}")
start_time = time.time()
delete_res = c.delete(expr=f"{pk} in {ids}")
cost = time.time() - start_time
log.info(f"delete cost {cost} with res {delete_res}")
if cost < 1:
time.sleep(1 - cost)
if __name__ == '__main__':
host = "xxx"
name = "fouram_3QEsE82U"
delete_with_rate(host, name, 0, 50000000, _batch=16000)
### Milvus Log
pods:
compact-opt-100m-2-milvus-datanode-74b5c7854b-xxcdl 1/1 Running 0 3h53m 10.104.14.7 4am-node18
### Anything else?
_No response_