vllm [Core] Add MultiprocessingGPUExecutor

This introduces a python multiprocessing-based executor that can be used as an alternative to Ray for single-node inferencing.

With the changes in this PR, Ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with --no-worker-use-ray.

The existing distributed tests have been updated to run with/without Ray.

Worker processes are shut down when the LLMEngine is garbage collected.

This replaces original PRs #3466 #2898 and #4345. It was originally co-authored by @sahilsuneja1.

May 01 '24 19:05 njhill

This is really great work @njhill. Thanks for all the effort! Will this change also enable ray to become an optional dependency?

May 01 '24 22:05 vrdn-23

I think --no-worker-use-ray is bad. I suggest something like --distributed-executor-backend, which can be either ray or mp , and we might have more in the future.

May 01 '24 22:05 youkaichao

I think --no-worker-use-ray is bad. I suggest something like --distributed-executor-backend, which can be either ray or mp , and we might have more in the future.

@youkaichao that sounds very reasonable, but maybe it could be a separate PR? This is not actually a newly introduced arg - there is already a boolean --worker-use-ray arg, it's just the (argparse-standard) way of specifying a "false" value for that arg.

May 01 '24 22:05 njhill

This is really great work @njhill. Thanks for all the effort! Will this change also enable ray to become an optional dependency?

Yes, although ray is already optional if you are only using single-GPU. I do have some changes to make it an optional "extra" from a python package installation pov but was thinking of a follow-on PR to avoid making this one bigger.

May 01 '24 22:05 njhill

If it is possible, I suggest add --distributed-executor-backend in this pr, and route --worker-use-ray to --distributed-executor-backend=ray . This PR can be enabled under --distributed-executor-backend=multiprocessing .

May 01 '24 22:05 youkaichao

@youkaichao any idea why the ray distributed CI test might be failing now due to a gloo timeout? I think it's something to do with a second engine using TP being created in the same pytest process after the first one is shut down (the test now runs with mp executor followed by ray executor). This wasn't a problem with an earlier version of this PR, but I know you've made changes in this area. I will dig in more but just wanted to check if it's anything obvious to you.

May 02 '24 03:05 njhill

Try to merge the main branch in? I'm not sure, but the latest commit i merge into main can pass the ci test

May 02 '24 03:05 youkaichao

@youkaichao fyi the problem is still there after pulling in your latest fix commit, I'll try to narrow it down tomorrow.

May 02 '24 05:05 njhill

My suspection is improper clean up. You can try to have one test for mp, and another for ray. Then they will not have interference.

May 02 '24 05:05 youkaichao

@njhill maybe we should cancel the test for this pr, until you figure it out locally? otherwise the ci will be blocked.

May 02 '24 05:05 youkaichao

Sorry to keep asking, but is there any update on this? @njhill @rkooo567

May 07 '24 22:05 vrdn-23

My suspection is improper clean up. You can try to have one test for mp, and another for ray. Then they will not have interference.

@youkaichao I've updated this now to run in separate tests. Do you think it would be worth opening a separate issue to address the distributed cleanup issue? Currently it seems you can't create an LLM with tensor parallel, delete it, and then create another one in the same process.

If it is possible, I suggest add --distributed-executor-backend in this pr, and route --worker-use-ray to --distributed-executor-backend=ray . This PR can be enabled under --distributed-executor-backend=multiprocessing .

I've now made this update as requested, can enable with --distributed-executor-backend=mp.

@youkaichao @rkooo567 hopefully this is now ready to merge? :pray: the failing tests look unrelated (same failures on main branch).

We could discuss in a follow-on whether it makes sense to change the default from ray to mp.

May 14 '24 15:05 njhill

I will finish review it by today!

May 14 '24 15:05 rkooo567

Do you think it would be worth opening a separate issue to address the distributed cleanup issue? Currently it seems you can't create an LLM with tensor parallel, delete it, and then create another one in the same process.

I have a comment on this: https://github.com/vllm-project/vllm/pull/4508#issuecomment-2087794774

TL;DR is Python garbage collection is unreliable. If we want to address the distributed cleanup issue, we need some user-interface change, like context manager to explicitly control the cleanup.

May 14 '24 17:05 youkaichao

Thanks again @youkaichao @rkooo567 @zhuohan123! And thanks for your patience @vrdn-23!

May 14 '24 17:05 njhill

Hi all, thanks for the efforts, just have one question, is there any performance difference between python multiprocessing and ray? Thanks in advance.

Jun 12 '24 09:06 kerthcet

@kerthcet we found multiprocessing to be faster, but since https://github.com/vllm-project/vllm/pull/4894 the difference probably isn't very significant, especially for larger models.

Jun 12 '24 20:06 njhill

vllm vllm copied to clipboard

[Core] Add MultiprocessingGPUExecutor

vllm
vllm copied to clipboard