milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: After upgrading from 2.5 to master, the old queryNode cannot exit

Open ThreadDao opened this issue 5 months ago • 4 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 2.5-20250613-5110130b-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

  • config
    common:
      enabledGrowingSegmentJSONKeyStats: true
      enabledJsonKeyStats: true
      enabledOptimizeExpr: false
    dataCoord:
      enableActiveStandby: true
      enabledJSONKeyStatsInSort: false
      jsonStatsTriggerCount: 10
      jsonStatsTriggerInterval: 10
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    queryCoord:
      enableActiveStandby: true
    rootCoord:
      enableActiveStandby: true

upgrade image from 2.5 to master

2.5-20250613-5110130b-amd64 -> master-20250613-1bf960b1-amd64 old queryNode cannot exit

zong-roll-ddl-3-milvus-datacoord-7579d84648-xj9zz                 1/1     Running       0               100m    10.104.32.122   4am-node39   <none>           <none>
zong-roll-ddl-3-milvus-datanode-8d5566789-hj5x2                   1/1     Running       0               100m    10.104.13.132   4am-node16   <none>           <none>
zong-roll-ddl-3-milvus-datanode-8d5566789-r5c24                   1/1     Running       0               100m    10.104.32.123   4am-node39   <none>           <none>
zong-roll-ddl-3-milvus-indexcoord-944d87cf4-2gdb7                 1/1     Running       0               100m    10.104.13.131   4am-node16   <none>           <none>
zong-roll-ddl-3-milvus-indexnode-7876459657-746rd                 1/1     Running       0               100m    10.104.33.67    4am-node36   <none>           <none>
zong-roll-ddl-3-milvus-indexnode-7876459657-cwljx                 1/1     Running       0               100m    10.104.15.20    4am-node20   <none>           <none>
zong-roll-ddl-3-milvus-mixcoord-5845dc46bd-vqw4z                  1/1     Running       0               37m     10.104.6.126    4am-node13   <none>           <none>
zong-roll-ddl-3-milvus-proxy-64d4f6f4c7-xpjjs                     1/1     Running       0               100m    10.104.27.203   4am-node31   <none>           <none>
zong-roll-ddl-3-milvus-querycoord-6774648bdb-5c4m4                1/1     Running       0               100m    10.104.33.66    4am-node36   <none>           <none>
zong-roll-ddl-3-milvus-querynode-0-66c498649f-jrlwb               1/1     Terminating   0               100m    10.104.6.112    4am-node13   <none>           <none>
zong-roll-ddl-3-milvus-querynode-1-6c9c986cd5-4jp8w               1/1     Running       0               36m     10.104.15.93    4am-node20   <none>           <none>
zong-roll-ddl-3-milvus-querynode-1-6c9c986cd5-v8msc               1/1     Running       0               4m36s   10.104.6.131    4am-node13   <none>           <none>
zong-roll-ddl-3-milvus-rootcoord-5665b9858-rprln                  1/1     Running       0               100m    10.104.6.111    4am-node13   <none>           <none>
zong-roll-ddl-3-milvus-streamingnode-c657f9dc8-xskfr              1/1     Running       0               37m     10.104.6.127    4am-node13   <none>           <none>

Expected Behavior

No response

Steps To Reproduce

argo workflow: zong-roll-ddl-3

Milvus Log

No response

Anything else?

No response

ThreadDao avatar Jun 13 '25 11:06 ThreadDao

/unassign

yanliang567 avatar Jun 14 '25 01:06 yanliang567

Image

The mixcoord is not startup while the old distributed coordinator is not down. The querynode is rolling update, and the segment and channel cannot be moved by old querycoord. We need to support rolling update the old distributed coordinator to new mixcoord before query node start rolling.

/assign @AlintaLu

chyezh avatar Jun 16 '25 02:06 chyezh

@chyezh: GitHub didn't allow me to assign the following users: AlintaLu.

Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

Image

The mixcoord is not startup while the old distributed coordinator is not down. The querynode is rolling update, and the segment and channel cannot be moved by old querycoord. We need to support rolling update the old distributed coordinator to new mixcoord before query node start rolling.

/assign @AlintaLu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sre-ci-robot avatar Jun 16 '25 02:06 sre-ci-robot

By confirmed by @haorenfsa We only support upgrade a 2.5 cluster with mixcoord to 2.6. If user want to upgrade a 2.5 cluster with distributed coord to 2.6, user need to change the distributed coord into mixcoord at 2.5.

/assign @ThreadDao So we only need to verify the cluster with mixcoord at 2.5. /unassign

chyezh avatar Jun 16 '25 06:06 chyezh

@haorenfsa @LoveEachDay

this might be a issue need to solved by operator and helm?

xiaofan-luan avatar Jun 22 '25 22:06 xiaofan-luan

Yes, will do later. For now we provided a doc to remove other coords for operator: https://milvus.io/docs/upgrade_milvus_cluster-operator.md#Upgrade-Milvus-Cluster-with-Milvus-Operator

haorenfsa avatar Jun 23 '25 17:06 haorenfsa

I will test the 2.5 image of the milvus where multiple coords are combined into one mixCoord

ThreadDao avatar Jun 26 '25 08:06 ThreadDao