milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Channel checkpoint lag keeps increasing and cluster down with no way to restore

Open Xinyi7 opened this issue 5 months ago • 15 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version:2.6.4
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 
dataNode
Replicas: 10

Requests: CPU '14', Memory '5Gi'

Limits: CPU '14', Memory '5Gi'


coordinador mixCoord
Replicas: 2

Requests: CPU '2', Memory '8Gi'

Limits: CPU '3', Memory '8Gi'


🌐 proxy
Replicas: 3

Requests: CPU '2', Memory '2Gi'

Limits: CPU '3', Memory '2Gi'

🔍 queryNode
Replicas: 10

Requests: CPU '32', Memory '32Gi'

Limits: CPU '32', Memory '32Gi'

🌊 streamingNode
Replicas: 8

Requests: CPU '16', Memory '8Gi'

Limits: CPU '16', Memory '8Gi'


- GPU: 
- Others:

Current Behavior

we started seeing this issue for v2.6 clusters, we get into issue where the cluster cannot take any traffic, both read and write.

It also times out when trying to make request to create collection.

we tried the following way but the cluster was not able to restore:

  • restarting the pods
  • deleting the collections

Expected Behavior

the cluster should not be down and there should be ways to restore it

Steps To Reproduce


Milvus Log

we see that the channel cp lag for one channel keeps increasing

Image

we got errors from streaming node: [2025/10/30 17:06:59.824 +00:00] [WARN] [walmanager/manager_impl.go:88] ["remove wal failed"] [module=streamingnode] [component=wal-manager] [error="code: STREAMING_CODE_IGNORED_OPERATION, cause: expired term 2587, cannot change expected state for remove"] [channel=posts-red-b-rootcoord-dml_11] [term=2587] [2025/10/30 17:06:59.824 +00:00] [WARN] [walmanager/manager_impl.go:88] ["remove wal failed"] [module=streamingnode] [component=wal-manager] [error="code: STREAMING_CODE_IGNORED_OPERATION, cause: expired term 2588, cannot change expected state for remove"] [channel=posts-red-b-rootcoord-dml_11] [term=2588] %4|1761844019.849|OFFSET|rdkafka#consumer-1297| [thrd:main]: posts-red-b-rootcoord-dml_11 [0]: offset reset (at offset 0, broker 4) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range

Anything else?

No response

Xinyi7 avatar Oct 31 '25 21:10 Xinyi7