[Bug]: Channel checkpoint lag keeps increasing and cluster down with no way to restore

Open Xinyi7 opened this issue 5 months ago • 15 comments

Is there an existing issue for this?

[x] I have searched the existing issues

Environment

- Milvus version:2.6.4
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 
dataNode
Replicas: 10

Requests: CPU '14', Memory '5Gi'

Limits: CPU '14', Memory '5Gi'


coordinador mixCoord
Replicas: 2

Requests: CPU '2', Memory '8Gi'

Limits: CPU '3', Memory '8Gi'


🌐 proxy
Replicas: 3

Requests: CPU '2', Memory '2Gi'

Limits: CPU '3', Memory '2Gi'

🔍 queryNode
Replicas: 10

Requests: CPU '32', Memory '32Gi'

Limits: CPU '32', Memory '32Gi'

🌊 streamingNode
Replicas: 8

Requests: CPU '16', Memory '8Gi'

Limits: CPU '16', Memory '8Gi'


- GPU: 
- Others:

Current Behavior

we started seeing this issue for v2.6 clusters, we get into issue where the cluster cannot take any traffic, both read and write.

It also times out when trying to make request to create collection.

we tried the following way but the cluster was not able to restore:

restarting the pods
deleting the collections

Expected Behavior

the cluster should not be down and there should be ways to restore it

Steps To Reproduce

Milvus Log

we see that the channel cp lag for one channel keeps increasing

we got errors from streaming node: [2025/10/30 17:06:59.824 +00:00] [WARN] [walmanager/manager_impl.go:88] ["remove wal failed"] [module=streamingnode] [component=wal-manager] [error="code: STREAMING_CODE_IGNORED_OPERATION, cause: expired term 2587, cannot change expected state for remove"] [channel=posts-red-b-rootcoord-dml_11] [term=2587] [2025/10/30 17:06:59.824 +00:00] [WARN] [walmanager/manager_impl.go:88] ["remove wal failed"] [module=streamingnode] [component=wal-manager] [error="code: STREAMING_CODE_IGNORED_OPERATION, cause: expired term 2588, cannot change expected state for remove"] [channel=posts-red-b-rootcoord-dml_11] [term=2588] %4|1761844019.849|OFFSET|rdkafka#consumer-1297| [thrd:main]: posts-red-b-rootcoord-dml_11 [0]: offset reset (at offset 0, broker 4) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range

Anything else?

No response

Oct 31 '25 21:10 Xinyi7