vllm [Core] Pipeline Parallel Support

Adds initial pipeline parallelism support to vLLM.

ToDo:

Milestone 1: POC Prototype

[x] Make changes to support multiple schedulers and cache engines in worker.py, llm_engine.py, async_llm_engine.py and block managers.
[x] Make changes to running loop to support multiple async steps in flight
[x] Make changes to support CUDA graphs for each kv cache.
[x] Make changes to LLaMa and GPT models to support send/recving and weight loading
[x] Make changes to ray_gpu_executor.py, worker.py and model_runner.py to support multiple driver workers
[x] Ensure execution works on 1 node. (May be affected by https://github.com/vllm-project/vllm/issues/4293, https://github.com/vllm-project/vllm/issues/4430, https://github.com/vllm-project/vllm/issues/4135. IMO should not block us)

Milestone 2: Mergeable

[x] Fix issues related to LLaMa incorrect outputs (Bug filed against PyTorch https://github.com/pytorch/pytorch/issues/125079 and worked around)
[x] Refactor to move sending and recving code out of models.
[x] Check if there's a simpler way to do weight loading
[x] Enable multi-node
[x] Add RFC for community benefit
[x] Add some testing
[x] Assert out models that are not supported yet as well as LLMEngine.
[x] Check if any PyNCCL changes are necessary
[ ] Rebase on latest
[x] Tests passing

FIX #4461

cc: @zhuohan123 @WoosukKwon @simon-mo @youkaichao

Apr 27 '24 08:04 andoorve

@andoorve - Exciting!!!

Apr 27 '24 11:04 robertgshaw2-redhat

@andoorve thanks for the effort! Can you write an RFC to describe the overall design so that people can easily understand it? example rfcs: https://github.com/vllm-project/vllm/issues?q=label%3ARFC+sort%3Aupdated-desc

Apr 27 '24 16:04 youkaichao

@youkaichao Yes for sure, it is one of the TODO items above

Apr 27 '24 17:04 andoorve

Updated the RFC here: https://github.com/vllm-project/vllm/issues/4461 @youkaichao

Let me know if anything needs further elaboration

Apr 29 '24 23:04 andoorve

FYI pretty sure PyTorch has a bug, filed here: https://github.com/pytorch/pytorch/issues/125079

Worked around this last week by making sending and receiving phase for each model atomic by concatenating residuals and hidden states.

Apr 29 '24 23:04 andoorve

@andoorve hi, I already made the change to pynccl to support multiple groups in https://github.com/vllm-project/vllm/pull/4512 . The first rank can be read from the group argument directly.

May 01 '24 05:05 youkaichao

Sounds good @youkaichao, I can update mine once that's merged.

Will you also include the change to create the multiple CPU TP groups or should I create a separate PR?

May 01 '24 05:05 andoorve

Will you also include the change to create the multiple CPU TP groups or should I create a separate PR?

Yes, that's also in my plan. I will break https://github.com/vllm-project/vllm/pull/4460 down into small pieces to be merged, ETA this week.

May 01 '24 05:05 youkaichao

Sounds good - I'll revert the PyNCCL changes on this PR and wait for that to be merged to add in

May 01 '24 05:05 andoorve

Hey @andoorve - This is super exciting!

I'm trying to run a simple example with PP = 2, but encountered some error at runtime. I coded my own example using the simple example script examples/offline_inference.py and added the pipeline_parallel_size=2 in the argument.

- llm = LLM(model="facebook/opt-125m", load_format="dummy")
+ llm = LLM(model="facebook/opt-2.7b", pipeline_parallel_size=2, load_format="dummy")

This is the error I hit: error.txt. It seems like it's complaining the kv_caches list item not found (probably empty?)


ERROR 05-01 20:45:18 worker_base.py:147] Traceback (most recent call last):
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm/vllm/worker/worker_base.py", line 139, in execute_method
ERROR 05-01 20:45:18 worker_base.py:147]     return executor(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147]     return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm/vllm/worker/worker.py", line 140, in determine_num_available_blocks
ERROR 05-01 20:45:18 worker_base.py:147]     self.model_runner.profile_run()
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147]     return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm/vllm/worker/model_runner.py", line 844, in profile_run
ERROR 05-01 20:45:18 worker_base.py:147]     self.execute_model(seqs, kv_caches)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147]     return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm/vllm/worker/model_runner.py", line 763, in execute_model
ERROR 05-01 20:45:18 worker_base.py:147]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147]     return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147]     return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm/vllm/model_executor/models/opt.py", line 300, in forward
ERROR 05-01 20:45:18 worker_base.py:147]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147]     return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147]     return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm/vllm/model_executor/models/opt.py", line 275, in forward
ERROR 05-01 20:45:18 worker_base.py:147]     return self.decoder(input_ids, positions, kv_caches, attn_metadata)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147]     return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147]     return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147]   File "/workspace/vllm/vllm/model_executor/models/opt.py", line 249, in forward
ERROR 05-01 20:45:18 worker_base.py:147]     hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
ERROR 05-01 20:45:18 worker_base.py:147] IndexError: list index out of range

I haven't dug into the code deep enough, and curious what is the best way to test and play around with it. If you can point me to some potential starting point, that would be awesome enough. Thanks!

May 01 '24 20:05 GindaChen

Hey @GindaChen there's a couple of things here,

We haven't supported OPT yet, also the LLMEngine entry point won't work. We're only supporting AsyncLLMEngine right now

May 01 '24 20:05 andoorve

The way I would recommend is try with the online serving entrypoint with the LLaMa model. That'd be the best way to start playing around with it

@GindaChen

May 01 '24 21:05 andoorve

@andoorve FYI: pynccl with multiple groups is landed at https://github.com/vllm-project/vllm/pull/4512 .

May 02 '24 18:05 youkaichao

Will you also include the change to create the multiple CPU TP groups or should I create a separate PR?

@andoorve please check out https://github.com/vllm-project/vllm/pull/4566 and see if you need anything else.

May 02 '24 19:05 youkaichao

LGTM - I guess one thing we can add is PP PyNCCL group

May 02 '24 19:05 andoorve

LGTM - I guess one thing we can add is PP PyNCCL group

That's in my plan. Which operation do you need for pp? allreduce? gather? or anything else?

May 02 '24 19:05 youkaichao

We only need point-to-point, blocking send and blocking recv only. It's not critical though unless torch.distributed.* ops don't work well with CUDA graph.

May 02 '24 19:05 andoorve

Hi @andoorve,

While benchmarking using your PR, I've consistently encountered engine timeouts with smaller models on setups far below total VRAM capacity, which might relate to the issues you've linked (e.g., [Bug]: Engine iteration timed out #4293, #4430, #4135). I'm using commit https://github.com/vllm-project/vllm/pull/4412/commits/9d698fa4c53491f3b07da6d325c60f856d61c333.

Setup and Reproduction: Models and Hardware:

Llama-2-7b-hf on 2x A100s
llama-160m on 2x RTX A4000s:

python -m vllm.entrypoints.openai.api_server --model JackFram/llama-160m \
--swap-space 16 \
--disable-log-requests \
--pipeline-parallel-size 2

python benchmarks/benchmark_serving.py --backend vllm --model JackFram/llama-160m \
--dataset-name sharegpt \
--dataset-path /workspace/sharegpt.json \
--num-prompts 3

Observation: Engine hangs almost immediately with 3 running prompts, similar issues with larger models at non-infinite --request-rate.

Proposed Solution:

I traced the issue to async.gather(*coros) in ray_gpu_executor.py returning prematurely because it does not block on ray.ObjectRefs. Inserting ray.wait(coros[1:]) before the gather aligns with the intended code semantics and resolves the hanging.

Branch with fix: https://github.com/SolitaryThinker/vllm/tree/pipeline-parallel-fix

I noticed a new commit from you regarding TP+PP fix, but it didn’t resolve the issue in my environment. Could it be due to missing the latest pynccl changes with groups https://github.com/vllm-project/vllm/pull/4512?

This is my first time handling VLLM and Ray, so any insights or corrections on my understanding or approach would be greatly appreciated.

Additional technical details: After some digging, I realized that async.gather(*coros) is returning before workers threads have finished. The cause is that coros consist of both futures and ray.ObjectRefs, the latter of which asyncio.gather does not appear to block on. Thus back in the run_engine_loop, the VE that is assumed to be finished executing after this call:

 done, _ = await asyncio.wait(requests_in_progress, return_when=asyncio.FIRST_COMPLETED)

call still could have workers running when a new engine_step task for the VE is created. I'm not sure the exact interaction that causes the hanging, but inserting a ray.wait(coros[1:]) before the gather seems to actually respect the intended semantics of the code to wait for materialization of the ray.objectref.

Thanks -will

May 06 '24 19:05 SolitaryThinker

@SolitaryThinker

Thanks for the thorough investigation and the fix!

It's indeed true that there are existing issues with hanging on the current vLLM mainline, and I have not rebased on the latest PyNCCL changes yet. I also am unable to reproduce this issue easily with GPT2 when I try with my own testing. For these reasons I haven't investigated as deeply yet. I'll give your setup and fix a try once I check if multi-node is functional.

I wonder if this is a similar reason as to why the TP-only cases are hanging in the issues mentioned above since there is no such ray.wait in that situation as well. In the meanwhile @rkooo567 maybe you might have some comments?

May 06 '24 20:05 andoorve

FYI: I recently find clean up logic is prone to hang, and this is "fixed" in https://github.com/vllm-project/vllm/pull/4508 .

May 06 '24 20:05 youkaichao

@SolitaryThinker I tried the model/commands above that are giving you issues. I was unable to reproduce on my setup.

My Setup

Started a fresh instance with the following:

GCP g2-standard-48 (4 x NVIDIA L4) Image: Google, Deep Learning VM with CUDA 12.1, M120, Debian 11, Python 3.10. With CUDA 12.1 preinstalled. vLLM install @ https://github.com/vllm-project/vllm/pull/4412/commits/04b5fe903ac4598b5337d457afd684426e384690

Experiments

Started vLLM with

python -m vllm.entrypoints.openai.api_server --model JackFram/llama-160m \
--swap-space 16 \
--disable-log-requests \
--pipeline-parallel-size 2

Ran the below 3 times:

python benchmarks/benchmark_serving.py --backend vllm --model JackFram/llama-160m \
--dataset-name sharegpt \
--dataset-path ~/sharegpt.json \
--num-prompts 3

Killed vLLM server then repeated the above experiment 2 more times for a total of 3 separate serving instances, 9 benchmark tries, and 27 total requests sent.

See expected benchmark results each time:

Traffic request rate: inf
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00,  2.18s/it]
============ Serving Benchmark Result ============
Successful requests:                     3
Benchmark duration (s):                  6.55
Total input tokens:                      72
Total generated tokens:                  1380
Request throughput (req/s):              0.46
Input token throughput (tok/s):          10.99
Output token throughput (tok/s):         210.70
---------------Time to First Token----------------
Mean TTFT (ms):                          29.55
Median TTFT (ms):                        27.27
P99 TTFT (ms):                           34.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.42
Median TPOT (ms):                        7.77
P99 TPOT (ms):                           7.80

I wonder if it might only be reproducible on other instances... needs further investigation though.

May 06 '24 22:05 andoorve

A very meaningful feature. Hi @andoorve ,I have conducted verification based on your PR, and currently, the service can start normally. However, an error occurs when processing requests. My env: RTX-4090 2 nodes vLLM install @ https://github.com/vllm-project/vllm/commit/04b5fe903ac4598b5337d457afd684426e384690

Here is the command:

python3 -m vllm.entrypoints.openai.api_server --trust-remote-code --model /data/llvm/llama_weight --gpu-memory-utilization 0.60 --pipeline-parallel-size 2 --port 8000 --host 0.0.0.0 --enforce-eager

And here is error stack:

ERROR 05-07 16:55:03 async_llm_engine.py:43] Engine background task failed
ERROR 05-07 16:55:03 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 05-07 16:55:03 async_llm_engine.py:43]   File "python/ray/_raylet.pyx", line 902, in ray._raylet.prepare_args_internal
ERROR 05-07 16:55:03 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 494, in serialize
ERROR 05-07 16:55:03 async_llm_engine.py:43]     return self._serialize_to_msgpack(value)
ERROR 05-07 16:55:03 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 472, in _serialize_to_msgpack
ERROR 05-07 16:55:03 async_llm_engine.py:43]     pickle5_serialized_object = self._serialize_to_pickle5(
ERROR 05-07 16:55:03 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 425, in _serialize_to_pickle5
ERROR 05-07 16:55:03 async_llm_engine.py:43]     raise e
ERROR 05-07 16:55:03 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 420, in _serialize_to_pickle5
ERROR 05-07 16:55:03 async_llm_engine.py:43]     inband = pickle.dumps(
ERROR 05-07 16:55:03 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps
ERROR 05-07 16:55:03 async_llm_engine.py:43]     cp.dump(obj)
ERROR 05-07 16:55:03 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
ERROR 05-07 16:55:03 async_llm_engine.py:43]     return Pickler.dump(self, obj)
ERROR 05-07 16:55:03 async_llm_engine.py:43] TypeError: cannot pickle 'torch._C.Generator' object

May 07 '24 09:05 zhengxingmao

@zhengxingmao Thanks for reporting this! Does this happen without PP? If not, I think it could be some interaction with the following flags with PP. --trust-remote-code --model /data/llvm/llama_weight --gpu-memory-utilization 0.60

Can you try without these flags and use a model directly from HF? (LLaMa)

May 07 '24 17:05 andoorve

@SolitaryThinker

I did some investigation into what you were saying. I think there are real hangs that appear. I tried LLaMa 3 8B with effectively infinite request rate on 2 L4s and saw hangs - not sure if this is the same situation that you found yourself in. Strangely, if I did a warm up request first, the hang went away.

The ray.wait solution doesn't help, and it's not intended for async contexts. See here https://docs.ray.io/en/latest/ray-core/api/doc/ray.wait.html:

This method will issue a warning if it’s running inside an async context. Instead of ray.wait(ray_waitables), you can use await asyncio.wait(ray_waitables).

Also from here, asyncio methods such as asyncio.wait and asyncio.gather should be sufficient: https://docs.ray.io/en/latest/ray-core/actors/async_api.html

I resolved a hang on my end with: https://github.com/vllm-project/vllm/pull/4412/commits/df9b0c45cee14395b2b2dff9c4e3343ab2a019a1

Maybe this helps for you?

May 07 '24 18:05 andoorve

@andoorve Thank you for your suggestion, after removing the parameters '--trust-remote-code', the service is now working normally. Today, I synchronized the latest code and found that when PP is enabled, there are limitations on the model types? I tried using Qwen2 and it reported an error saying that this model is not supported. After adding it to the list of supported models and fixing an out-of-bounds issue with the kv_cache[i] index,

--- a/vllm/model_executor/models/qwen2.py
+++ b/vllm/model_executor/models/qwen2.py
@@ -248,7 +248,8 @@ class Qwen2Model(nn.Module):
     ) -> torch.Tensor:
         hidden_states = self.embed_tokens(input_ids)
         residual = None
-        for i in range(len(self.layers)):
+        #for i in range(len(self.layers)):
+        for i, (layer, kv_cache) in enumerate(zip(self.layers, kv_caches)): ---->changed here 
             layer = self.layers[i]
             hidden_states, residual = layer(
                 positions,

I found that the model can run, but the returned results are incorrect and don’t make sense. However, when PP=1, it runs normally. So, are there any special requirements for enabling PP>1 for this model?

Is it relate to this issue https://github.com/pytorch/pytorch/issues/125079 ?

May 08 '24 02:05 zhengxingmao

Hi @zhengxingmao, thanks for trying it out. I think if --trust-remote-code does not work even on main branch can you file a bug against the repo? Not sure if that's just on this branch.

As for Qwen2, there is no special reason we do not support it right now. We just decided to support LLaMa and GPT2 for this PR as they are a good example for the rest of the models.

This PR is just intended to enable PP generally, and enable future PRs for specific models such as Qwen2 which can be added as necessary. Model support will be very similar to what was done for LLaMa and GPT2 here.

From what I can tell, I think you may not have enabled it correctly for Qwen2 on your local. I think it will be useful to check that it follows what was done here for LLaMa closely: https://github.com/vllm-project/vllm/blob/7e993601f47e68afe31b30ac66f9252956ce58c9/vllm/model_executor/models/llama.py.

The issue above (https://github.com/pytorch/pytorch/issues/125079) isn't related to this, and has been worked around already.

May 08 '24 02:05 andoorve

If it is related with --trust-remote-code, then maybe this is already fixed in main by https://github.com/vllm-project/vllm/pull/4286 .

May 08 '24 02:05 youkaichao

Hi @zhengxingmao, thanks for trying it out. I think if --trust-remote-code does not work even on main branch can you file a bug against the repo? Not sure if that's just on this branch.

As for Qwen2, there is no special reason we do not support it right now. We just decided to support LLaMa and GPT2 for this PR as they are a good example for the rest of the models.

This PR is just intended to enable PP generally, and enable future PRs for specific models such as Qwen2 which can be added as necessary. Model support will be very similar to what was done for LLaMa and GPT2 here.

From what I can tell, I think you may not have enabled it correctly for Qwen2 on your local. I think it will be useful to check that it follows what was done here for LLaMa closely: https://github.com/vllm-project/vllm/blob/7e993601f47e68afe31b30ac66f9252956ce58c9/vllm/model_executor/models/llama.py.

The issue above (pytorch/pytorch#125079) isn't related to this, and has been worked around already.

Qwen2 is also a very popular model at the moment, and I hope that support for it can be considered. However, I will first try to support it according to the suggestions you have given.

May 08 '24 03:05 zhengxingmao

If it is related with --trust-remote-code, then maybe this is already fixed in main by #4286 .

Running in PP=1 with --trust-remote-code options is normally.

May 08 '24 03:05 zhengxingmao

Works for me ok with python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --port 8092 --enable-chunked-prefill --enforce-eager --pipeline-parallel-size 2 --trust-remote-code

Qwen2 is also a very popular model at the moment, and I hope that support for it can be considered. However, I will first try to support it according to the suggestions you have given.

Yup, eventually all of the models will have pipeline parallel support.

May 08 '24 04:05 andoorve