vllm
vllm copied to clipboard
[Core] Add MultiprocessingGPUExecutor
This introduces a python multiprocessing-based executor that can be used as an alternative to Ray for single-node inferencing.
With the changes in this PR, Ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with --no-worker-use-ray
.
The existing distributed tests have been updated to run with/without Ray.
Worker processes are shut down when the LLMEngine
is garbage collected.
This replaces original PRs #3466 #2898 and #4345. It was originally co-authored by @sahilsuneja1.
This is really great work @njhill. Thanks for all the effort! Will this change also enable ray to become an optional dependency?
I think --no-worker-use-ray
is bad. I suggest something like --distributed-executor-backend
, which can be either ray
or mp
, and we might have more in the future.
I think
--no-worker-use-ray
is bad. I suggest something like--distributed-executor-backend
, which can be eitherray
ormp
, and we might have more in the future.
@youkaichao that sounds very reasonable, but maybe it could be a separate PR? This is not actually a newly introduced arg - there is already a boolean --worker-use-ray
arg, it's just the (argparse-standard) way of specifying a "false" value for that arg.
This is really great work @njhill. Thanks for all the effort! Will this change also enable ray to become an optional dependency?
Yes, although ray is already optional if you are only using single-GPU. I do have some changes to make it an optional "extra" from a python package installation pov but was thinking of a follow-on PR to avoid making this one bigger.
If it is possible, I suggest add --distributed-executor-backend
in this pr, and route --worker-use-ray
to --distributed-executor-backend=ray
. This PR can be enabled under --distributed-executor-backend=multiprocessing
.
@youkaichao any idea why the ray distributed CI test might be failing now due to a gloo timeout? I think it's something to do with a second engine using TP being created in the same pytest process after the first one is shut down (the test now runs with mp executor followed by ray executor). This wasn't a problem with an earlier version of this PR, but I know you've made changes in this area. I will dig in more but just wanted to check if it's anything obvious to you.
Try to merge the main branch in? I'm not sure, but the latest commit i merge into main can pass the ci test
@youkaichao fyi the problem is still there after pulling in your latest fix commit, I'll try to narrow it down tomorrow.
My suspection is improper clean up. You can try to have one test for mp, and another for ray. Then they will not have interference.
@njhill maybe we should cancel the test for this pr, until you figure it out locally? otherwise the ci will be blocked.
Sorry to keep asking, but is there any update on this? @njhill @rkooo567
My suspection is improper clean up. You can try to have one test for mp, and another for ray. Then they will not have interference.
@youkaichao I've updated this now to run in separate tests. Do you think it would be worth opening a separate issue to address the distributed cleanup issue? Currently it seems you can't create an LLM
with tensor parallel, delete it, and then create another one in the same process.
If it is possible, I suggest add
--distributed-executor-backend
in this pr, and route--worker-use-ray
to--distributed-executor-backend=ray
. This PR can be enabled under--distributed-executor-backend=multiprocessing
.
I've now made this update as requested, can enable with --distributed-executor-backend=mp
.
@youkaichao @rkooo567 hopefully this is now ready to merge? :pray: the failing tests look unrelated (same failures on main branch).
We could discuss in a follow-on whether it makes sense to change the default from ray
to mp
.
I will finish review it by today!
Do you think it would be worth opening a separate issue to address the distributed cleanup issue? Currently it seems you can't create an LLM with tensor parallel, delete it, and then create another one in the same process.
I have a comment on this: https://github.com/vllm-project/vllm/pull/4508#issuecomment-2087794774
TL;DR is Python garbage collection is unreliable. If we want to address the distributed cleanup issue, we need some user-interface change, like context manager to explicitly control the cleanup.
Thanks again @youkaichao @rkooo567 @zhuohan123! And thanks for your patience @vrdn-23!
Hi all, thanks for the efforts, just have one question, is there any performance difference between python multiprocessing and ray? Thanks in advance.
@kerthcet we found multiprocessing to be faster, but since https://github.com/vllm-project/vllm/pull/4894 the difference probably isn't very significant, especially for larger models.