[Bug]: Channel checkpoint lag keeps increasing and cluster down with no way to restore
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version:2.6.4
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory:
dataNode
Replicas: 10
Requests: CPU '14', Memory '5Gi'
Limits: CPU '14', Memory '5Gi'
coordinador mixCoord
Replicas: 2
Requests: CPU '2', Memory '8Gi'
Limits: CPU '3', Memory '8Gi'
🌐 proxy
Replicas: 3
Requests: CPU '2', Memory '2Gi'
Limits: CPU '3', Memory '2Gi'
🔍 queryNode
Replicas: 10
Requests: CPU '32', Memory '32Gi'
Limits: CPU '32', Memory '32Gi'
🌊 streamingNode
Replicas: 8
Requests: CPU '16', Memory '8Gi'
Limits: CPU '16', Memory '8Gi'
- GPU:
- Others:
Current Behavior
we started seeing this issue for v2.6 clusters, we get into issue where the cluster cannot take any traffic, both read and write.
It also times out when trying to make request to create collection.
we tried the following way but the cluster was not able to restore:
- restarting the pods
- deleting the collections
Expected Behavior
the cluster should not be down and there should be ways to restore it
Steps To Reproduce
Milvus Log
we see that the channel cp lag for one channel keeps increasing
we got errors from streaming node: [2025/10/30 17:06:59.824 +00:00] [WARN] [walmanager/manager_impl.go:88] ["remove wal failed"] [module=streamingnode] [component=wal-manager] [error="code: STREAMING_CODE_IGNORED_OPERATION, cause: expired term 2587, cannot change expected state for remove"] [channel=posts-red-b-rootcoord-dml_11] [term=2587] [2025/10/30 17:06:59.824 +00:00] [WARN] [walmanager/manager_impl.go:88] ["remove wal failed"] [module=streamingnode] [component=wal-manager] [error="code: STREAMING_CODE_IGNORED_OPERATION, cause: expired term 2588, cannot change expected state for remove"] [channel=posts-red-b-rootcoord-dml_11] [term=2588] %4|1761844019.849|OFFSET|rdkafka#consumer-1297| [thrd:main]: posts-red-b-rootcoord-dml_11 [0]: offset reset (at offset 0, broker 4) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range
Anything else?
No response