milvus
milvus copied to clipboard
[Bug]: [Nightly] Milvus cluster(pulsar) run timeout for pulsarv3-bookie keep restarting
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: d35c33d
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Milvus cluster(pulsar) run timeout for pulsarv3-bookie keep restarting.
mdpm-master-382-py-n-etcd-0 1/1 Running 0 4h16m
mdpm-master-382-py-n-milvus-datanode-66b9c56c9f-r9sfg 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-milvus-datanode-66b9c56c9f-vbdgq 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-milvus-mixcoord-5fdff4b856-ljx4n 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-milvus-proxy-58d44f5848-6mtxm 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-milvus-proxy-58d44f5848-rdfts 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-milvus-querynode-9b7646d6c-j65vt 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-milvus-querynode-9b7646d6c-lrkb8 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-milvus-streamingnode-7cc5ff7774-x9cjr 1/1 Running 4 (3h56m ago) 4h16m
mdpm-master-382-py-n-milvus-streamingnode-7cc5ff7774-zqt4z 1/1 Running 2 (4h16m ago) 4h16m
mdpm-master-382-py-n-minio-67dd56fbd7-ktnmz 1/1 Running 0 4h16m
mdpm-master-382-py-n-pulsarv3-bookie-0 1/1 Running 30 (2m3s ago) 4h16m
mdpm-master-382-py-n-pulsarv3-bookie-1 1/1 Running 31 (4m32s ago) 4h16m
mdpm-master-382-py-n-pulsarv3-bookie-2 1/1 Running 30 (2m8s ago) 4h16m
mdpm-master-382-py-n-pulsarv3-bookie-init-24t4d 0/1 Completed 0 4h16m
mdpm-master-382-py-n-pulsarv3-broker-0 1/1 Running 0 4h16m
mdpm-master-382-py-n-pulsarv3-broker-1 1/1 Running 0 4h16m
mdpm-master-382-py-n-pulsarv3-proxy-0 1/1 Running 0 4h16m
mdpm-master-382-py-n-pulsarv3-proxy-1 1/1 Running 0 4h16m
mdpm-master-382-py-n-pulsarv3-pulsar-init-m55wd 0/1 Completed 0 4h16m
mdpm-master-382-py-n-pulsarv3-zookeeper-0 1/1 Running 0 4h16m
pulsarv3-bookie keep restarting
Expected Behavior
pass
Steps To Reproduce
Milvus Log
https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/382/pipeline/123/
Anything else?
No response
The pipeline distributed-pulsar has been 1 hour slower that other deploy modes before Jun 13, 2025. (about 3h 45min in total) After Jun 13, 2025, the restart times of pod pulsarv3-bookie doubled which may lead to nightly timeout(6h).
/assign @chyezh /unassign
It seems that the bookie is OOMKilled.
mdpm-master-380-py-n-pulsarv3-bookie-2 42 OOMKilled
still working on it.
we limit 2GB memory for bookkeeper. But bookeeper default setup here.
-Xms4096m -Xmx4096m -XX:MaxDirectMemorySize=8192m
So OOM happens.
/assign @NicoYuan1986 It seems that it's fixed after modifying the configuration of bookie, please help to verify it. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/383/pipeline /unassign
Thanks for the quick fix! b902960