milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [streaming] Query count(*) results in more than expected during chaos kill streamingNode container

Open ThreadDao opened this issue 7 months ago • 9 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: chyezh-enhance_make_recovery_components_full-b41ce80-20250427
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): woodpecker   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

  • streamingNode: 2 * 4c16g
  • queryNode: 2 * 4c16g
  • milvus.yaml config
  config:
    common:
      enabledJSONKeyStats: true
    dataCoord:
      enableActiveStandby: true
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    queryCoord:
      enableActiveStandby: true
    queryNode:
      mmap:
        growingMmapEnabled: true
        scalarField: true
        scalarIndex: true
        vectorField: true
        vectorIndex: true
    rootCoord:
      enableActiveStandby: true
    streaming:
      walWriteAheadBuffer:
        capacity: 1m
        keepalive: 0.5s

client test

  1. create collection fouram_jDzJ1VeB with fields: pk + vector + int64_1(partition_key) + json_1
  2. create vector index HNSW
  3. insert 10m entities -> flush -> index again -> load again
  4. concurrent requests: query count(*) + search + upsert + flush + scene_search_test
  • upsert: start pk from 0
  • scene_search_test: create collection -> index -> load -> insert 1w -> flush-> index -> load -> search -> drop collection
  1. apply chaos to kill a streamingNode container randomly every 2 minutes during 10 minutes

result

During chaos, query count(*) return succ, but the actual count is more than expected 10m Image

Expected Behavior

always return 10m count(*)

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/60f7c994-12e4-49a2-a476-892c155200b3?nodeId=zong-chaos-clu-wp-sn-3-4263864101

Milvus Log

pods:

zong-chaos-clu-wp-sn-3-etcd-0                                     1/1     Running     0               13h     10.104.18.57    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-etcd-1                                     1/1     Running     0               13h     10.104.34.60    4am-node37   <none>           <none>
zong-chaos-clu-wp-sn-3-etcd-2                                     1/1     Running     0               13h     10.104.27.117   4am-node31   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-datanode-5747cd787d-9nmzm           1/1     Running     0               13h     10.104.34.61    4am-node37   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-datanode-5747cd787d-jl529           1/1     Running     0               13h     10.104.6.98     4am-node13   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-mixcoord-7bdb75c595-4ssbv           1/1     Running     0               13h     10.104.18.63    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-mixcoord-7bdb75c595-5kxtc           1/1     Running     0               13h     10.104.20.155   4am-node22   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-proxy-59dcc469fd-kf5kk              1/1     Running     0               13h     10.104.14.200   4am-node18   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-querynode-0-746c9d75fc-488xk        1/1     Running     0               13h     10.104.18.64    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-querynode-0-746c9d75fc-f4j45        1/1     Running     0               13h     10.104.19.97    4am-node28   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-streamingnode-5f8b94c55b-thhvq      1/1     Running     4 (12h ago)     13h     10.104.16.119   4am-node21   <none>           <none>
zong-chaos-clu-wp-sn-3-milvus-streamingnode-5f8b94c55b-zg6kt      1/1     Running     4 (12h ago)     13h     10.104.17.64    4am-node23   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-0                                    1/1     Running     0               13h     10.104.18.58    4am-node25   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-1                                    1/1     Running     0               13h     10.104.30.66    4am-node38   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-2                                    1/1     Running     0               13h     10.104.27.118   4am-node31   <none>           <none>
zong-chaos-clu-wp-sn-3-minio-3                                    1/1     Running     0               13h     10.104.24.100   4am-node29   <none>           <none>

Anything else?

No response

ThreadDao avatar Apr 28 '25 03:04 ThreadDao

may related to the log lost: #41563

chyezh avatar Apr 28 '25 05:04 chyezh

reproduced argo monitor

chyezh avatar May 07 '25 06:05 chyezh

count(*) is wrong before applying chaos

ThreadDao avatar May 14 '25 09:05 ThreadDao

reproduce on 2.5-20250514-813bcb14-amd64

ThreadDao avatar May 16 '25 08:05 ThreadDao

reproduce on master-20250519-38ded736-amd64 But it has nothing to do with the chaos. The count was wrong before the chaos.

ThreadDao avatar May 19 '25 07:05 ThreadDao

/assign @ThreadDao

liliu-z avatar Jun 10 '25 11:06 liliu-z

/assign @ThreadDao should be fixed, please help to verify it. /unassign

chyezh avatar Jun 14 '25 12:06 chyezh

@chyezh Image

  • image: master-20250616-5e184417-amd64
  • argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/eb560b14-8977-45dd-96f2-c17cfdd18939?nodeId=zong-chaos-standalone-26-4

ThreadDao avatar Jun 16 '25 09:06 ThreadDao

/assign

chyezh avatar Jun 18 '25 02:06 chyezh

may increase huge amount of count result. already fixed by #42689.

/assign @ThreadDao /unassign

chyezh avatar Jun 20 '25 02:06 chyezh

I tested 6 times and didn't reproduced fixed master-20250620-b043ff14-amd64

ThreadDao avatar Jun 20 '25 10:06 ThreadDao

@weiliu1031 https://argo-workflows.zilliz.cc/archived-workflows/qa/e277daf0-b244-4d6e-a98f-c9af9f9e79f2?nodeId=zong-chaos-pod-sn-1751824800-966436947

ThreadDao avatar Jul 07 '25 09:07 ThreadDao

@weiliu1031 master-20250709-7f8c5c9b-amd64

  • https://argo-workflows.zilliz.cc/archived-workflows/qa/7ec1629f-7be2-4c4d-83b3-e08c96560f0a?nodeId=zong-chaos-standalone-1752087600
  • https://argo-workflows.zilliz.cc/archived-workflows/qa/01594036-4cbc-403b-a680-63d52882e9c4?nodeId=zong-chaos-pod-dn-1752084000

ThreadDao avatar Jul 10 '25 03:07 ThreadDao

@weiliu1031

  • image: master-20250715-fe8de016-amd64
  • argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/b5cba396-50e9-4386-a4ee-527ce69900d6?nodeId=zong-chaos-standalone-1752606000
Traceback (most recent call last):
  File "/src/fouram/client/concurrent/locust_client.py", line 28, in wrapper
    result = func(*args, **kwargs)
  File "/src/fouram/client/cases/base.py", line 890, in concurrent_query
    return self.collection_wrap.query(expr=params.query_expr, **params.obj_params)
  File "/src/fouram/client/client_base/collection_wrapper.py", line 164, in query
    check_result = ResponseChecker(res, func_name, check_task, check_items, res_result, expression=expr,
  File "/src/fouram/client/check/func_check.py", line 90, in run
    result = self.check_query_output_count(self.response, self.succ, self.check_items)
  File "/src/fouram/client/check/func_check.py", line 338, in check_query_output_count
    assert int(query_count) == expected_query_count, f'{query_count} == {expected_query_count}'
AssertionError: 9997600 == 10000000
  • pods:
zong-chaos-standalone-1752606000-milvus-standalone-75bcc889k2ln   1/1     Running     0               6h43m   10.104.18.17    4am-node25   <none>           <none>

ThreadDao avatar Jul 16 '25 02:07 ThreadDao

/assign @chyezh

chyezh avatar Jul 17 '25 12:07 chyezh

/assign @zhagnlu

chyezh avatar Aug 11 '25 12:08 chyezh

/assign @ThreadDao

yanliang567 avatar Aug 18 '25 07:08 yanliang567

not reproduced

ThreadDao avatar Aug 22 '25 03:08 ThreadDao