milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: fail to search on QueryNode 219: Timestamp lag too large

Open yesyue opened this issue 6 months ago • 3 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 2.4.5
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):     kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):  CentOS
- CPU/Memory: 32/128
- GPU: 
- Others:

Current Behavior

[2025/05/26 01:37:14.243 +00:00] [WARN] [kafka/kafka_consumer.go:138] ["consume msg failed"] [topic=by-dev-rootcoord-dml_5] [groupID=querynode-236-by-dev-rootcoord-dml_5_453153786713474132v0-false] [error="Local: Timed out"]

Expected Behavior

No response

Steps To Reproduce


Milvus Log

No response

Anything else?

No response

yesyue avatar May 26 '25 03:05 yesyue

We need to know what causes the timestamp lag too large. Usually it was caused by heavy insert/delete requests or something wrong with datanode/indexnode or MQ. please attach the completed milvus logs for investigation. If you install Milvus with k8s, please refer this doc to export the whole Milvus logs. @yesyue

/unassign

yanliang567 avatar May 26 '25 07:05 yanliang567

I encountered similar after heavy inserts (inserting 220M records, that had been running for at least a day): MilvusException: <MilvusException: (code=65535, message=fail to search on QueryNode 15: Timestamp lag too large)>

Attached are the exported logs.

milvus-log.tar.gz

jfelectron avatar Jun 15 '25 18:06 jfelectron

From the logs there seems to be a lot things going on in this cluster

  1. we see many drop collections happened in this cluster, this might be expected.
  2. we see the dropping didn't get succeeded, and it's highly likely due to pulsar is not working correctly
  3. My suggestions is to check the healthy state of pulsar, check the disk of bookie, or run bin/bookkeeper shell listbookies -ro and check if all the bookeeper has been put into read only states.

you may need to scale the pulsar or reduce the retention time when you hit in this issue, check https://milvus.io/docs/scale-dependencies.md

I encountered similar after heavy inserts (inserting 220M records, that had been running for at least a day): MilvusException: <MilvusException: (code=65535, message=fail to search on QueryNode 15: Timestamp lag too large)>

Attached are the exported logs.

milvus-log.tar.gz

xiaofan-luan avatar Jun 15 '25 23:06 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jul 16 '25 00:07 stale[bot]