milvus
milvus copied to clipboard
[Question]: Standalone launch 800+ threads, is there thread leak ?
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master (commit 9621b661)
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Devopt team find that there is 7k+ threads in one query node pod.
Is there thread leak in milvus ?
Expected Behavior
If there is thread leak, we should fix it. If it's expected behavior, better give us an explanation.
Steps To Reproduce
Test this in my ubuntu 18.04 machine:
1. launch standalone, there is only 33 thread in this process
❯ ps -ef | grep milvus
496:caiyd 16695 1 8 21:15 pts/5 00:00:03 ./bin/milvus run standalone
500:caiyd 17615 20752 0 21:16 pts/6 00:00:00 grep -nE --color=auto milvus
❯ cat /proc/16695/status | grep Thread
35:Threads: 33
- run pytest, observe the thread num keep increasing and stop at 805
❯ pytest -n 8 ./testcases/test_search.py
... ...
pytest -n 8 ./testcases/test_search.py
❯ cat /proc/16695/status | grep Thread
35:Threads: 225
❯ cat /proc/16695/status | grep Thread
35:Threads: 236
❯ cat /proc/16695/status | grep Thread
35:Threads: 258
❯ cat /proc/16695/status | grep Thread
35:Threads: 726
❯ cat /proc/16695/status | grep Thread
35:Threads: 726
❯ cat /proc/16695/status | grep Thread
35:Threads: 805
❯ cat /proc/16695/status | grep Thread
35:Threads: 805
- wait pytest finish, the thread num keep at 805
-- Docs: https://docs.pytest.org/en/stable/warnings.html
----------------------------------------------------------------------------------------- generated html file: file:///tmp/ci_logs/report.html ------------------------------------------------------------------------------------------
Results (1625.32s):
2716 passed
42 xpassed
40 xfailed
62 skipped
[get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs/
[create_path] folder(/tmp/ci_logs/) is not exist.
[create_path] create path now...
❯ cat /proc/16695/status | grep Thread
35:Threads: 805
- repeat step 2~3, thread num increase to 827
❯ cat /proc/16695/status | grep Thread
35:Threads: 805
❯ cat /proc/16695/status | grep Thread
35:Threads: 827
❯ cat /proc/16695/status | grep Thread
35:Threads: 827
### Milvus Log
_No response_
### Anything else?
_No response_
/assign @congqixia
After discussion with @longjiquan , the preliminary conclusions are, there are two parts of implementation which caused this symptom:
- When doing
segment.Search
inQueryNode
, milvus dispatches a new goroutine for each cgo call. This will cause the thread number increase to max number of segments searched cocurrently - In segcore/knowhere,
openmp
will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.
Here are some suggestion we could try:
- Add cgo worker pool with the size of CPU number * [fixed ratio]( this ratio shall be generated from some performance test)
- Control the concurrency in segcore/knowhere and add a mechanism to recycle extra thread if needed
@czs007 @cydrain @longjiquan @xiaofan-luan WTYGT?
In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.
Is that true? does that make sense? @cydrain
- for quick fix, if we remove segment level paralzation, does it help?
In the future:
- We only support request level parallel (Controlled by segcore scheduler) and data level parallel (Controlled by knowhere)
- Should probably remove OMP and changed into a global thread pool to avoid thread creation?
From Knowhere
side, we can consider force setting omp thread num in Wrapper
layer to ensure the consistency performance between different kinds of indexes. This num can be controlled by segcore
Submit a quick fix to limit cgo call concurrency with GOMAXPROC (cpu number) #18461
In segcore/knowhere, openmp will create thread independently for each cgo call. And it seems that event these threads could be reused, there are no way to recycle them.
Is that true? does that make sense? @cydrain
Faiss uses lots of "omp parallel" in it's code. Threads created by omp will be destroyed after execution. The threads left in Milvus must not be caused by knowhere.
@congqixia was this fixed?
@yanliang567 just checked the last standalone run of CI(ms-18745-3). The os threads number(go runtime) is 71 at most. It's expected to have more thread in OpenMP thread pool but still under control.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.