milvus [Question]: Standalone launch 800+ threads, is there thread leak ?

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: master (commit 9621b661)
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Devopt team find that there is 7k+ threads in one query node pod.

Is there thread leak in milvus ?

Expected Behavior

If there is thread leak, we should fix it. If it's expected behavior, better give us an explanation.

Steps To Reproduce

Test this in my ubuntu 18.04 machine:

1. launch standalone, there is only 33 thread in this process

❯ ps -ef | grep milvus
496:caiyd    16695     1  8 21:15 pts/5    00:00:03 ./bin/milvus run standalone
500:caiyd    17615 20752  0 21:16 pts/6    00:00:00 grep -nE --color=auto milvus
❯ cat /proc/16695/status | grep Thread
35:Threads:	33

run pytest, observe the thread num keep increasing and stop at 805

❯ pytest -n 8 ./testcases/test_search.py   
... ...
 pytest -n 8 ./testcases/test_search.py 
❯ cat /proc/16695/status | grep Thread
35:Threads:	225
❯ cat /proc/16695/status | grep Thread
35:Threads:	236
❯ cat /proc/16695/status | grep Thread
35:Threads:	258
❯ cat /proc/16695/status | grep Thread
35:Threads:	726
❯ cat /proc/16695/status | grep Thread
35:Threads:	726
❯ cat /proc/16695/status | grep Thread
35:Threads:	805
❯ cat /proc/16695/status | grep Thread
35:Threads:	805

wait pytest finish, the thread num keep at 805

-- Docs: https://docs.pytest.org/en/stable/warnings.html
----------------------------------------------------------------------------------------- generated html file: file:///tmp/ci_logs/report.html ------------------------------------------------------------------------------------------

Results (1625.32s):
    2716 passed
      42 xpassed
      40 xfailed
      62 skipped
[get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs/
[create_path] folder(/tmp/ci_logs/) is not exist.
[create_path] create path now...
❯ cat /proc/16695/status | grep Thread
35:Threads:	805

repeat step 2~3, thread num increase to 827

❯ cat /proc/16695/status | grep Thread
35:Threads:	805
❯ cat /proc/16695/status | grep Thread
35:Threads:	827
❯ cat /proc/16695/status | grep Thread
35:Threads:	827



### Milvus Log

_No response_

### Anything else?

_No response_

Jul 29 '22 02:07 cydrain

/assign @congqixia

Jul 29 '22 02:07 cydrain

After discussion with @longjiquan , the preliminary conclusions are, there are two parts of implementation which caused this symptom:

When doing segment.Search in QueryNode, milvus dispatches a new goroutine for each cgo call. This will cause the thread number increase to max number of segments searched cocurrently
In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.

Here are some suggestion we could try:

Add cgo worker pool with the size of CPU number * [fixed ratio]( this ratio shall be generated from some performance test)
Control the concurrency in segcore/knowhere and add a mechanism to recycle extra thread if needed

@czs007 @cydrain @longjiquan @xiaofan-luan WTYGT?

Jul 29 '22 04:07 congqixia

In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.

Is that true? does that make sense? @cydrain

Jul 29 '22 05:07 xiaofan-luan

for quick fix, if we remove segment level paralzation, does it help?

Jul 29 '22 05:07 xiaofan-luan

In the future:

We only support request level parallel (Controlled by segcore scheduler) and data level parallel (Controlled by knowhere)
Should probably remove OMP and changed into a global thread pool to avoid thread creation?

Jul 29 '22 05:07 xiaofan-luan

From Knowhere side, we can consider force setting omp thread num in Wrapper layer to ensure the consistency performance between different kinds of indexes. This num can be controlled by segcore

Jul 29 '22 06:07 liliu-z

Submit a quick fix to limit cgo call concurrency with GOMAXPROC (cpu number) #18461

Jul 29 '22 08:07 congqixia

In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.

Is that true? does that make sense? @cydrain

Faiss uses lots of "omp parallel" in it's code. Threads created by omp will be destroyed after execution. The threads left in Milvus must not be caused by knowhere.

Jul 30 '22 03:07 cydrain

@congqixia was this fixed?

Aug 23 '22 01:08 yanliang567

@yanliang567 just checked the last standalone run of CI(ms-18745-3). The os threads number(go runtime) is 71 at most. It's expected to have more thread in OpenMP thread pool but still under control.

Aug 23 '22 02:08 congqixia

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Sep 22 '22 04:09 stale[bot]

milvus milvus copied to clipboard

[Question]: Standalone launch 800+ threads, is there thread leak ?

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

milvus
milvus copied to clipboard