milvus
milvus copied to clipboard
[Bug]: [Nightly]Nightly test has taken more time on average than before and sometimes failed for timeout
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 43a9e17
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0.dev7
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Nightly test has taken more time on average than before and sometimes failed for timeout. Milvus not panic. Latest: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/340/pipeline/123
Expected Behavior
work as before
Steps To Reproduce
No response
Milvus Log
artifacts-milvus-standalone-nightly-340-pymilvus-e2e-logs.tar.gz
Anything else?
No response
Other information:
The situation seems to start on April 9th. https://jenkins.milvus.io:18080/job/Milvus%20Nightly%20CI/job/master/337/
It has been continuously reproduced since 4/9, so set it as urgent.
Master branch:
Nightly on this commit: 3c52d76, it runs ended in 1h34m Nightly on this commit: d85f673, it started to ended in 3h48m.
The commits between 3c52d76 and d85f673 are:
so @longjiquan could you help to have a look first? Thanks.
2.2.0 branch:
Nightly on this commit: dfee106, it runs ended in 1h 30m, Nightly on this commit: 7e0f3e9, it runs ended in 3h 15m
The commits between dfee106 and 7e0f3e9 are:
So from the master and 2.2.0 branch, the suspicious commit may be "Change balance check interval to 10s",
@xiaofan-luan could you please have a look either? Thanks.
I will take a look into it
Update: kafka and standalone failed for timeout. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/346/pipeline/155 The cases are less than half done within 6 hours.
Update: kafka and standalone failed for timeout. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/348/pipeline/155
/assign @sunby
@binbinlv @NicoYuan1986 Maybe we can change the checkInterval back to 1s and retry. This parameter controls all checkers' interval and may lead to slow collection loading.
/assign @binbinlv
Will verify through tonight's nightly.
@sunby
Verified, Nightly time of the master branch decreased, fixed.
But 2.2.0 branch is still not OK, I think 2.2.0 has the same issue, could you please fix it on 2.2.0 branch either? Thanks.
/assign @binbinlv
I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?
/assign @binbinlv
I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?
Checkers will be executed in one goroutine sequentially. So IMHO it will not take too much cpu. Is the purpose of changing this parameter to 10s to fix a certain issue?
/assign @binbinlv
I think it's probably not a good idea to check every 1s if there are no events(nodes up/down) because it might takes too much cpus on a cluster with many collections. Thoughts?
Checkers will be executed in one goroutine sequentially. So IMHO it will not take too much cpu. Is the purpose of changing this parameter to 10s to fix a certain issue?
No, just because log are printed too often and user see cpu usage to be high on 2c4g machines
see also #23870 Shall be fixed by #23925 #23928 @binbinlv
The time for master and 2.2.* nightly decreased to the previous normal time, fixed, and closed. milvus: 946ddc7 (2.2.*)