milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Question]: Standalone launch 800+ threads, is there thread leak ?

Open cydrain opened this issue 2 years ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master (commit 9621b661)
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Devopt team find that there is 7k+ threads in one query node pod.

Is there thread leak in milvus ?

Expected Behavior

If there is thread leak, we should fix it. If it's expected behavior, better give us an explanation.

Steps To Reproduce

Test this in my ubuntu 18.04 machine:

1. launch standalone, there is only 33 thread in this process

❯ ps -ef | grep milvus
496:caiyd    16695     1  8 21:15 pts/5    00:00:03 ./bin/milvus run standalone
500:caiyd    17615 20752  0 21:16 pts/6    00:00:00 grep -nE --color=auto milvus
❯ cat /proc/16695/status | grep Thread
35:Threads:	33
  1. run pytest, observe the thread num keep increasing and stop at 805
❯ pytest -n 8 ./testcases/test_search.py   
... ...
 pytest -n 8 ./testcases/test_search.py 
❯ cat /proc/16695/status | grep Thread
35:Threads:	225
❯ cat /proc/16695/status | grep Thread
35:Threads:	236
❯ cat /proc/16695/status | grep Thread
35:Threads:	258
❯ cat /proc/16695/status | grep Thread
35:Threads:	726
❯ cat /proc/16695/status | grep Thread
35:Threads:	726
❯ cat /proc/16695/status | grep Thread
35:Threads:	805
❯ cat /proc/16695/status | grep Thread
35:Threads:	805
  1. wait pytest finish, the thread num keep at 805
-- Docs: https://docs.pytest.org/en/stable/warnings.html
----------------------------------------------------------------------------------------- generated html file: file:///tmp/ci_logs/report.html ------------------------------------------------------------------------------------------

Results (1625.32s):
    2716 passed
      42 xpassed
      40 xfailed
      62 skipped
[get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs/
[create_path] folder(/tmp/ci_logs/) is not exist.
[create_path] create path now...
❯ cat /proc/16695/status | grep Thread
35:Threads:	805
  1. repeat step 2~3, thread num increase to 827
❯ cat /proc/16695/status | grep Thread
35:Threads:	805
❯ cat /proc/16695/status | grep Thread
35:Threads:	827
❯ cat /proc/16695/status | grep Thread
35:Threads:	827


### Milvus Log

_No response_

### Anything else?

_No response_

cydrain avatar Jul 29 '22 02:07 cydrain

/assign @congqixia

cydrain avatar Jul 29 '22 02:07 cydrain

After discussion with @longjiquan , the preliminary conclusions are, there are two parts of implementation which caused this symptom:

  • When doing segment.Search in QueryNode, milvus dispatches a new goroutine for each cgo call. This will cause the thread number increase to max number of segments searched cocurrently
  • In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.

Here are some suggestion we could try:

  • Add cgo worker pool with the size of CPU number * [fixed ratio]( this ratio shall be generated from some performance test)
  • Control the concurrency in segcore/knowhere and add a mechanism to recycle extra thread if needed

@czs007 @cydrain @longjiquan @xiaofan-luan WTYGT?

congqixia avatar Jul 29 '22 04:07 congqixia

In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.

Is that true? does that make sense? @cydrain

xiaofan-luan avatar Jul 29 '22 05:07 xiaofan-luan

  1. for quick fix, if we remove segment level paralzation, does it help?

xiaofan-luan avatar Jul 29 '22 05:07 xiaofan-luan

In the future:

  1. We only support request level parallel (Controlled by segcore scheduler) and data level parallel (Controlled by knowhere)
  2. Should probably remove OMP and changed into a global thread pool to avoid thread creation?

xiaofan-luan avatar Jul 29 '22 05:07 xiaofan-luan

From Knowhere side, we can consider force setting omp thread num in Wrapper layer to ensure the consistency performance between different kinds of indexes. This num can be controlled by segcore

liliu-z avatar Jul 29 '22 06:07 liliu-z

Submit a quick fix to limit cgo call concurrency with GOMAXPROC (cpu number) #18461

congqixia avatar Jul 29 '22 08:07 congqixia

In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.

Is that true? does that make sense? @cydrain

Faiss uses lots of "omp parallel" in it's code. Threads created by omp will be destroyed after execution. The threads left in Milvus must not be caused by knowhere.

cydrain avatar Jul 30 '22 03:07 cydrain

@congqixia was this fixed?

yanliang567 avatar Aug 23 '22 01:08 yanliang567

@yanliang567 just checked the last standalone run of CI(ms-18745-3). The os threads number(go runtime) is 71 at most. It's expected to have more thread in OpenMP thread pool but still under control.

congqixia avatar Aug 23 '22 02:08 congqixia

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 22 '22 04:09 stale[bot]