milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [Nightly] Milvus cluster(pulsar) run timeout for pulsarv3-bookie keep restarting

Open NicoYuan1986 opened this issue 5 months ago • 5 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: d35c33d
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Milvus cluster(pulsar) run timeout for pulsarv3-bookie keep restarting.

mdpm-master-382-py-n-etcd-0                                  1/1     Running     0                4h16m
mdpm-master-382-py-n-milvus-datanode-66b9c56c9f-r9sfg        1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-milvus-datanode-66b9c56c9f-vbdgq        1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-milvus-mixcoord-5fdff4b856-ljx4n        1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-milvus-proxy-58d44f5848-6mtxm           1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-milvus-proxy-58d44f5848-rdfts           1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-milvus-querynode-9b7646d6c-j65vt        1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-milvus-querynode-9b7646d6c-lrkb8        1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-milvus-streamingnode-7cc5ff7774-x9cjr   1/1     Running     4 (3h56m ago)    4h16m
mdpm-master-382-py-n-milvus-streamingnode-7cc5ff7774-zqt4z   1/1     Running     2 (4h16m ago)    4h16m
mdpm-master-382-py-n-minio-67dd56fbd7-ktnmz                  1/1     Running     0                4h16m
mdpm-master-382-py-n-pulsarv3-bookie-0                       1/1     Running     30 (2m3s ago)    4h16m
mdpm-master-382-py-n-pulsarv3-bookie-1                       1/1     Running     31 (4m32s ago)   4h16m
mdpm-master-382-py-n-pulsarv3-bookie-2                       1/1     Running     30 (2m8s ago)    4h16m
mdpm-master-382-py-n-pulsarv3-bookie-init-24t4d              0/1     Completed   0                4h16m
mdpm-master-382-py-n-pulsarv3-broker-0                       1/1     Running     0                4h16m
mdpm-master-382-py-n-pulsarv3-broker-1                       1/1     Running     0                4h16m
mdpm-master-382-py-n-pulsarv3-proxy-0                        1/1     Running     0                4h16m
mdpm-master-382-py-n-pulsarv3-proxy-1                        1/1     Running     0                4h16m
mdpm-master-382-py-n-pulsarv3-pulsar-init-m55wd              0/1     Completed   0                4h16m
mdpm-master-382-py-n-pulsarv3-zookeeper-0                    1/1     Running     0                4h16m

pulsarv3-bookie keep restarting

Expected Behavior

pass

Steps To Reproduce


Milvus Log

https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/382/pipeline/123/

Anything else?

No response

NicoYuan1986 avatar Jun 16 '25 01:06 NicoYuan1986

The pipeline distributed-pulsar has been 1 hour slower that other deploy modes before Jun 13, 2025. (about 3h 45min in total) After Jun 13, 2025, the restart times of pod pulsarv3-bookie doubled which may lead to nightly timeout(6h).

NicoYuan1986 avatar Jun 16 '25 01:06 NicoYuan1986

/assign @chyezh /unassign

yanliang567 avatar Jun 16 '25 02:06 yanliang567

It seems that the bookie is OOMKilled.

mdpm-master-380-py-n-pulsarv3-bookie-2   42         OOMKilled

still working on it.

chyezh avatar Jun 16 '25 07:06 chyezh

we limit 2GB memory for bookkeeper. But bookeeper default setup here.

-Xms4096m -Xmx4096m -XX:MaxDirectMemorySize=8192m

So OOM happens.

chyezh avatar Jun 16 '25 07:06 chyezh

/assign @NicoYuan1986 It seems that it's fixed after modifying the configuration of bookie, please help to verify it. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/383/pipeline /unassign

chyezh avatar Jun 17 '25 02:06 chyezh

Thanks for the quick fix! b902960

NicoYuan1986 avatar Jun 24 '25 01:06 NicoYuan1986