milvus [Bug]: [streaming] Query count(*) results in more than expected during chaos kill streamingNode container

Is there an existing issue for this?

[x] I have searched the existing issues

Environment

- Milvus version: chyezh-enhance_make_recovery_components_full-b41ce80-20250427
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): woodpecker   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

streamingNode: 2 * 4c16g
queryNode: 2 * 4c16g
milvus.yaml config

  config:
    common:
      enabledJSONKeyStats: true
    dataCoord:
      enableActiveStandby: true
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    queryCoord:
      enableActiveStandby: true
    queryNode:
      mmap:
        growingMmapEnabled: true
        scalarField: true
        scalarIndex: true
        vectorField: true
        vectorIndex: true
    rootCoord:
      enableActiveStandby: true
    streaming:
      walWriteAheadBuffer:
        capacity: 1m
        keepalive: 0.5s

client test

create collection fouram_jDzJ1VeB with fields: pk + vector + int64_1(partition_key) + json_1
create vector index HNSW
insert 10m entities -> flush -> index again -> load again
concurrent requests: query count(*) + search + upsert + flush + scene_search_test

upsert: start pk from 0
scene_search_test: create collection -> index -> load -> insert 1w -> flush-> index -> load -> search -> drop collection

apply chaos to kill a streamingNode container randomly every 2 minutes during 10 minutes

result

During chaos, query count(*) return succ, but the actual count is more than expected 10m

Expected Behavior

always return 10m count(*)

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/60f7c994-12e4-49a2-a476-892c155200b3?nodeId=zong-chaos-clu-wp-sn-3-4263864101

Milvus Log

pods:

zong-chaos-clu-wp-sn-3-etcd-0                                     1/1     Running     0               13h     10.104.18.57    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-etcd-1                                     1/1     Running     0               13h     10.104.34.60    4am-node37   <none>           <none>
zong-chaos-clu-wp-sn-3-etcd-2                                     1/1     Running     0               13h     10.104.27.117   4am-node31   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-datanode-5747cd787d-9nmzm           1/1     Running     0               13h     10.104.34.61    4am-node37   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-datanode-5747cd787d-jl529           1/1     Running     0               13h     10.104.6.98     4am-node13   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-mixcoord-7bdb75c595-4ssbv           1/1     Running     0               13h     10.104.18.63    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-mixcoord-7bdb75c595-5kxtc           1/1     Running     0               13h     10.104.20.155   4am-node22   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-proxy-59dcc469fd-kf5kk              1/1     Running     0               13h     10.104.14.200   4am-node18   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-querynode-0-746c9d75fc-488xk        1/1     Running     0               13h     10.104.18.64    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-querynode-0-746c9d75fc-f4j45        1/1     Running     0               13h     10.104.19.97    4am-node28   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-streamingnode-5f8b94c55b-thhvq      1/1     Running     4 (12h ago)     13h     10.104.16.119   4am-node21   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-streamingnode-5f8b94c55b-zg6kt      1/1     Running     4 (12h ago)     13h     10.104.17.64    4am-node23   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-0                                    1/1     Running     0               13h     10.104.18.58    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-1                                    1/1     Running     0               13h     10.104.30.66    4am-node38   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-2                                    1/1     Running     0               13h     10.104.27.118   4am-node31   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-3                                    1/1     Running     0               13h     10.104.24.100   4am-node29   <none>           <none>

Anything else?

No response

Apr 28 '25 03:04 ThreadDao

may related to the log lost: #41563

Apr 28 '25 05:04 chyezh

reproduced argo monitor

May 07 '25 06:05 chyezh

count(*) is wrong before applying chaos

May 14 '25 09:05 ThreadDao

reproduce on 2.5-20250514-813bcb14-amd64

argo: zong-chaos-clu-dev-a

May 16 '25 08:05 ThreadDao

reproduce on master-20250519-38ded736-amd64 But it has nothing to do with the chaos. The count was wrong before the chaos.

argo: ong-chaos-standalone-1

May 19 '25 07:05 ThreadDao

/assign @ThreadDao

Jun 10 '25 11:06 liliu-z

/assign @ThreadDao should be fixed, please help to verify it. /unassign

Jun 14 '25 12:06 chyezh

@chyezh

image: master-20250616-5e184417-amd64
argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/eb560b14-8977-45dd-96f2-c17cfdd18939?nodeId=zong-chaos-standalone-26-4

Jun 16 '25 09:06 ThreadDao

/assign

Jun 18 '25 02:06 chyezh

may increase huge amount of count result. already fixed by #42689.

/assign @ThreadDao /unassign

Jun 20 '25 02:06 chyezh

I tested 6 times and didn't reproduced fixed master-20250620-b043ff14-amd64

Jun 20 '25 10:06 ThreadDao

@weiliu1031 https://argo-workflows.zilliz.cc/archived-workflows/qa/e277daf0-b244-4d6e-a98f-c9af9f9e79f2?nodeId=zong-chaos-pod-sn-1751824800-966436947

Jul 07 '25 09:07 ThreadDao

@weiliu1031 master-20250709-7f8c5c9b-amd64

https://argo-workflows.zilliz.cc/archived-workflows/qa/7ec1629f-7be2-4c4d-83b3-e08c96560f0a?nodeId=zong-chaos-standalone-1752087600
https://argo-workflows.zilliz.cc/archived-workflows/qa/01594036-4cbc-403b-a680-63d52882e9c4?nodeId=zong-chaos-pod-dn-1752084000

Jul 10 '25 03:07 ThreadDao

@weiliu1031

image: master-20250715-fe8de016-amd64
argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/b5cba396-50e9-4386-a4ee-527ce69900d6?nodeId=zong-chaos-standalone-1752606000

Traceback (most recent call last):
  File "/src/fouram/client/concurrent/locust_client.py", line 28, in wrapper
    result = func(*args, **kwargs)
  File "/src/fouram/client/cases/base.py", line 890, in concurrent_query
    return self.collection_wrap.query(expr=params.query_expr, **params.obj_params)
  File "/src/fouram/client/client_base/collection_wrapper.py", line 164, in query
    check_result = ResponseChecker(res, func_name, check_task, check_items, res_result, expression=expr,
  File "/src/fouram/client/check/func_check.py", line 90, in run
    result = self.check_query_output_count(self.response, self.succ, self.check_items)
  File "/src/fouram/client/check/func_check.py", line 338, in check_query_output_count
    assert int(query_count) == expected_query_count, f'{query_count} == {expected_query_count}'
AssertionError: 9997600 == 10000000

pods:

zong-chaos-standalone-1752606000-milvus-standalone-75bcc889k2ln   1/1     Running     0               6h43m   10.104.18.17    4am-node25   <none>           <none>

Jul 16 '25 02:07 ThreadDao

/assign @chyezh

Jul 17 '25 12:07 chyezh

/assign @zhagnlu

Aug 11 '25 12:08 chyezh

/assign @ThreadDao

Aug 18 '25 07:08 yanliang567

not reproduced

Aug 22 '25 03:08 ThreadDao

milvus milvus copied to clipboard

[Bug]: [streaming] Query count(*) results in more than expected during chaos kill streamingNode container

Is there an existing issue for this?

Environment

Current Behavior

server

client test

result

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard