milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [Nightly]Nightly test has taken more time on average than before and sometimes failed for timeout

Open NicoYuan1986 opened this issue 1 year ago • 17 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 43a9e17
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0.dev7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Nightly test has taken more time on average than before and sometimes failed for timeout. Milvus not panic. Latest: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/340/pipeline/123

Expected Behavior

work as before

Steps To Reproduce

No response

Milvus Log

artifacts-milvus-standalone-nightly-340-pymilvus-e2e-logs.tar.gz

Anything else?

No response

NicoYuan1986 avatar Apr 13 '23 03:04 NicoYuan1986

Other information: img_v2_e1cd3d34-c07e-4b23-b835-fe1bc798a06g

NicoYuan1986 avatar Apr 13 '23 03:04 NicoYuan1986

The situation seems to start on April 9th. https://jenkins.milvus.io:18080/job/Milvus%20Nightly%20CI/job/master/337/

NicoYuan1986 avatar Apr 13 '23 03:04 NicoYuan1986

It has been continuously reproduced since 4/9, so set it as urgent.

binbinlv avatar Apr 13 '23 06:04 binbinlv

Master branch:

Nightly on this commit: 3c52d76, it runs ended in 1h34m Nightly on this commit: d85f673, it started to ended in 3h48m.

The commits between 3c52d76 and d85f673 are: uSNzmXb1cF

so @longjiquan could you help to have a look first? Thanks.

binbinlv avatar Apr 13 '23 06:04 binbinlv

2.2.0 branch:

Nightly on this commit: dfee106, it runs ended in 1h 30m, Nightly on this commit: 7e0f3e9, it runs ended in 3h 15m

The commits between dfee106 and 7e0f3e9 are: pdxh8ElIwK

binbinlv avatar Apr 13 '23 07:04 binbinlv

So from the master and 2.2.0 branch, the suspicious commit may be "Change balance check interval to 10s",

@xiaofan-luan could you please have a look either? Thanks.

binbinlv avatar Apr 13 '23 07:04 binbinlv

I will take a look into it

xiaofan-luan avatar Apr 17 '23 22:04 xiaofan-luan

Update: kafka and standalone failed for timeout. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/346/pipeline/155 The cases are less than half done within 6 hours.

NicoYuan1986 avatar Apr 19 '23 03:04 NicoYuan1986

Update: kafka and standalone failed for timeout. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/348/pipeline/155

NicoYuan1986 avatar Apr 21 '23 02:04 NicoYuan1986

/assign @sunby

jiaoew1991 avatar Apr 25 '23 08:04 jiaoew1991

@binbinlv @NicoYuan1986 Maybe we can change the checkInterval back to 1s and retry. This parameter controls all checkers' interval and may lead to slow collection loading.

sunby avatar Apr 25 '23 08:04 sunby

/assign @binbinlv

sunby avatar Apr 27 '23 08:04 sunby

Will verify through tonight's nightly.

binbinlv avatar Apr 27 '23 08:04 binbinlv

@sunby

Verified, Nightly time of the master branch decreased, fixed.

But 2.2.0 branch is still not OK, I think 2.2.0 has the same issue, could you please fix it on 2.2.0 branch either? Thanks.

binbinlv avatar Apr 28 '23 02:04 binbinlv

/assign @binbinlv

I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?

xiaofan-luan avatar Apr 28 '23 02:04 xiaofan-luan

/assign @binbinlv

I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?

Checkers will be executed in one goroutine sequentially. So IMHO it will not take too much cpu. Is the purpose of changing this parameter to 10s to fix a certain issue?

sunby avatar Apr 28 '23 03:04 sunby

/assign @binbinlv

I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?

Checkers will be executed in one goroutine sequentially. So IMHO it will not take too much cpu. Is the purpose of changing this parameter to 10s to fix a certain issue?

No, just because log are printed too often and user see cpu usage to be high on 2c4g machines

xiaofan-luan avatar Apr 28 '23 04:04 xiaofan-luan

see also #23870 Shall be fixed by #23925 #23928 @binbinlv

congqixia avatar May 08 '23 07:05 congqixia

The time for master and 2.2.* nightly decreased to the previous normal time, fixed, and closed. milvus: 946ddc7 (2.2.*)

binbinlv avatar May 09 '23 02:05 binbinlv