milvus
milvus copied to clipboard
[Bug]: [streaming] Query count(*) results in more than expected during chaos kill streamingNode container
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: chyezh-enhance_make_recovery_components_full-b41ce80-20250427
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): woodpecker
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
server
- streamingNode: 2 * 4c16g
- queryNode: 2 * 4c16g
- milvus.yaml config
config:
common:
enabledJSONKeyStats: true
dataCoord:
enableActiveStandby: true
indexCoord:
enableActiveStandby: true
log:
level: debug
queryCoord:
enableActiveStandby: true
queryNode:
mmap:
growingMmapEnabled: true
scalarField: true
scalarIndex: true
vectorField: true
vectorIndex: true
rootCoord:
enableActiveStandby: true
streaming:
walWriteAheadBuffer:
capacity: 1m
keepalive: 0.5s
client test
- create collection
fouram_jDzJ1VeBwith fields: pk + vector + int64_1(partition_key) + json_1 - create vector index HNSW
- insert 10m entities -> flush -> index again -> load again
- concurrent requests: query count(*) + search + upsert + flush + scene_search_test
- upsert: start pk from 0
- scene_search_test: create collection -> index -> load -> insert 1w -> flush-> index -> load -> search -> drop collection
- apply chaos to kill a streamingNode container randomly every 2 minutes during 10 minutes
result
During chaos, query count(*) return succ, but the actual count is more than expected 10m
Expected Behavior
always return 10m count(*)
Steps To Reproduce
https://argo-workflows.zilliz.cc/archived-workflows/qa/60f7c994-12e4-49a2-a476-892c155200b3?nodeId=zong-chaos-clu-wp-sn-3-4263864101
Milvus Log
pods:
zong-chaos-clu-wp-sn-3-etcd-0 1/1 Running 0 13h 10.104.18.57 4am-node25 <none> <none>
zong-chaos-clu-wp-sn-3-etcd-1 1/1 Running 0 13h 10.104.34.60 4am-node37 <none> <none>
zong-chaos-clu-wp-sn-3-etcd-2 1/1 Running 0 13h 10.104.27.117 4am-node31 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-datanode-5747cd787d-9nmzm 1/1 Running 0 13h 10.104.34.61 4am-node37 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-datanode-5747cd787d-jl529 1/1 Running 0 13h 10.104.6.98 4am-node13 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-mixcoord-7bdb75c595-4ssbv 1/1 Running 0 13h 10.104.18.63 4am-node25 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-mixcoord-7bdb75c595-5kxtc 1/1 Running 0 13h 10.104.20.155 4am-node22 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-proxy-59dcc469fd-kf5kk 1/1 Running 0 13h 10.104.14.200 4am-node18 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-querynode-0-746c9d75fc-488xk 1/1 Running 0 13h 10.104.18.64 4am-node25 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-querynode-0-746c9d75fc-f4j45 1/1 Running 0 13h 10.104.19.97 4am-node28 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-streamingnode-5f8b94c55b-thhvq 1/1 Running 4 (12h ago) 13h 10.104.16.119 4am-node21 <none> <none>
zong-chaos-clu-wp-sn-3-milvus-streamingnode-5f8b94c55b-zg6kt 1/1 Running 4 (12h ago) 13h 10.104.17.64 4am-node23 <none> <none>
zong-chaos-clu-wp-sn-3-minio-0 1/1 Running 0 13h 10.104.18.58 4am-node25 <none> <none>
zong-chaos-clu-wp-sn-3-minio-1 1/1 Running 0 13h 10.104.30.66 4am-node38 <none> <none>
zong-chaos-clu-wp-sn-3-minio-2 1/1 Running 0 13h 10.104.27.118 4am-node31 <none> <none>
zong-chaos-clu-wp-sn-3-minio-3 1/1 Running 0 13h 10.104.24.100 4am-node29 <none> <none>
Anything else?
No response
may related to the log lost: #41563
reproduce on master-20250519-38ded736-amd64
But it has nothing to do with the chaos. The count was wrong before the chaos.
- argo: ong-chaos-standalone-1
/assign @ThreadDao
/assign @ThreadDao should be fixed, please help to verify it. /unassign
@chyezh
- image: master-20250616-5e184417-amd64
- argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/eb560b14-8977-45dd-96f2-c17cfdd18939?nodeId=zong-chaos-standalone-26-4
/assign
may increase huge amount of count result. already fixed by #42689.
/assign @ThreadDao /unassign
I tested 6 times and didn't reproduced
fixed master-20250620-b043ff14-amd64
@weiliu1031 https://argo-workflows.zilliz.cc/archived-workflows/qa/e277daf0-b244-4d6e-a98f-c9af9f9e79f2?nodeId=zong-chaos-pod-sn-1751824800-966436947
@weiliu1031 master-20250709-7f8c5c9b-amd64
- https://argo-workflows.zilliz.cc/archived-workflows/qa/7ec1629f-7be2-4c4d-83b3-e08c96560f0a?nodeId=zong-chaos-standalone-1752087600
- https://argo-workflows.zilliz.cc/archived-workflows/qa/01594036-4cbc-403b-a680-63d52882e9c4?nodeId=zong-chaos-pod-dn-1752084000
@weiliu1031
- image: master-20250715-fe8de016-amd64
- argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/b5cba396-50e9-4386-a4ee-527ce69900d6?nodeId=zong-chaos-standalone-1752606000
Traceback (most recent call last):
File "/src/fouram/client/concurrent/locust_client.py", line 28, in wrapper
result = func(*args, **kwargs)
File "/src/fouram/client/cases/base.py", line 890, in concurrent_query
return self.collection_wrap.query(expr=params.query_expr, **params.obj_params)
File "/src/fouram/client/client_base/collection_wrapper.py", line 164, in query
check_result = ResponseChecker(res, func_name, check_task, check_items, res_result, expression=expr,
File "/src/fouram/client/check/func_check.py", line 90, in run
result = self.check_query_output_count(self.response, self.succ, self.check_items)
File "/src/fouram/client/check/func_check.py", line 338, in check_query_output_count
assert int(query_count) == expected_query_count, f'{query_count} == {expected_query_count}'
AssertionError: 9997600 == 10000000
- pods:
zong-chaos-standalone-1752606000-milvus-standalone-75bcc889k2ln 1/1 Running 0 6h43m 10.104.18.17 4am-node25 <none> <none>
/assign @chyezh
/assign @zhagnlu
/assign @ThreadDao
not reproduced