milvus [Bug]: [Nightly]Nightly test has taken more time on average than before and sometimes failed for timeout

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 43a9e17
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0.dev7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Nightly test has taken more time on average than before and sometimes failed for timeout. Milvus not panic. Latest: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/340/pipeline/123

Expected Behavior

work as before

Steps To Reproduce

No response

Milvus Log

artifacts-milvus-standalone-nightly-340-pymilvus-e2e-logs.tar.gz

Anything else?

No response

Apr 13 '23 03:04 NicoYuan1986

Other information: img_v2_e1cd3d34-c07e-4b23-b835-fe1bc798a06g

Apr 13 '23 03:04 NicoYuan1986

The situation seems to start on April 9th. https://jenkins.milvus.io:18080/job/Milvus%20Nightly%20CI/job/master/337/

Apr 13 '23 03:04 NicoYuan1986

It has been continuously reproduced since 4/9, so set it as urgent.

Apr 13 '23 06:04 binbinlv

Master branch:

Nightly on this commit: 3c52d76, it runs ended in 1h34m Nightly on this commit: d85f673, it started to ended in 3h48m.

The commits between 3c52d76 and d85f673 are: uSNzmXb1cF

so @longjiquan could you help to have a look first? Thanks.

Apr 13 '23 06:04 binbinlv

2.2.0 branch:

Nightly on this commit: dfee106, it runs ended in 1h 30m, Nightly on this commit: 7e0f3e9, it runs ended in 3h 15m

The commits between dfee106 and 7e0f3e9 are: pdxh8ElIwK

Apr 13 '23 07:04 binbinlv

So from the master and 2.2.0 branch, the suspicious commit may be "Change balance check interval to 10s",

@xiaofan-luan could you please have a look either? Thanks.

Apr 13 '23 07:04 binbinlv

I will take a look into it

Apr 17 '23 22:04 xiaofan-luan

Update: kafka and standalone failed for timeout. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/346/pipeline/155 The cases are less than half done within 6 hours.

Apr 19 '23 03:04 NicoYuan1986

Update: kafka and standalone failed for timeout. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/348/pipeline/155

Apr 21 '23 02:04 NicoYuan1986

/assign @sunby

Apr 25 '23 08:04 jiaoew1991

@binbinlv @NicoYuan1986 Maybe we can change the checkInterval back to 1s and retry. This parameter controls all checkers' interval and may lead to slow collection loading.

Apr 25 '23 08:04 sunby

/assign @binbinlv

Apr 27 '23 08:04 sunby

Will verify through tonight's nightly.

Apr 27 '23 08:04 binbinlv

@sunby

Verified, Nightly time of the master branch decreased, fixed.

But 2.2.0 branch is still not OK, I think 2.2.0 has the same issue, could you please fix it on 2.2.0 branch either? Thanks.

Apr 28 '23 02:04 binbinlv

/assign @binbinlv

I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?

Apr 28 '23 02:04 xiaofan-luan

/assign @binbinlv

I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?

Checkers will be executed in one goroutine sequentially. So IMHO it will not take too much cpu. Is the purpose of changing this parameter to 10s to fix a certain issue?

Apr 28 '23 03:04 sunby

/assign @binbinlv

I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?

Checkers will be executed in one goroutine sequentially. So IMHO it will not take too much cpu. Is the purpose of changing this parameter to 10s to fix a certain issue?

No, just because log are printed too often and user see cpu usage to be high on 2c4g machines

Apr 28 '23 04:04 xiaofan-luan

see also #23870 Shall be fixed by #23925 #23928 @binbinlv

May 08 '23 07:05 congqixia

The time for master and 2.2.* nightly decreased to the previous normal time, fixed, and closed. milvus: 946ddc7 (2.2.*)

May 09 '23 02:05 binbinlv

milvus milvus copied to clipboard

[Bug]: [Nightly]Nightly test has taken more time on average than before and sometimes failed for timeout

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard