vllm [Feature]: Precise model device placement

🚀 The feature, motivation and pitch

Hi all, I was wondering if it's possible to do precise model device placement. For example, I would like to place the vLLM model on GPU 1 and let GPU 0 do other things. Being able to do precise model device placement will help unblock online RLHF work in our Hugging Face's TRL, because we want to leverage the fast speed of vLLM's generation.

In particular, we'd like to run training on 7 GPUs, and leave only 1 GPU for vLLM inference. I have a very crude hack that supports this at https://github.com/vwxyzjn/vllm/pull/1, but I figure more general support in vLLM will be more helpful.

Currently this is not possible because the following code will error out

from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="gpt2", tensor_parallel_size=1, device="cuda:1")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Alternatives

No response

Additional context

No response

Jul 07 '24 14:07 vwxyzjn

Possibly a stupid question, but have you considered setting CUDA_VISIBLE_DEVICES via shell when running the vLLM script?

Jul 08 '24 05:07 DarkLight1337

That was the first thing I tried. What happens is that the script cannot see the training GPUs

Jul 08 '24 13:07 vwxyzjn

Why does the vLLM inference script need to see the training GPUs?

Jul 08 '24 13:07 DarkLight1337

I would like to run training and inference in the same script, so I can easily load the online trained weights to vLLM more easily (unless there is another way to doing it more elegantly).

import time

import torch
from accelerate import Accelerator
from accelerate.state import PartialState
from transformers import AutoModelForCausalLM, AutoTokenizer

from vllm import SamplingParams, SingleGPULLM

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
tok = AutoTokenizer.from_pretrained("vwxyzjn/ppo_zephyr7")
prompt_ids = tok.batch_encode_plus(prompts)["input_ids"]
accelerator = Accelerator(gradient_accumulation_steps=2)
state = PartialState()
llm = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/mistral-7b-sft-beta")
llm = llm.to(accelerator.device)
accelerator.print(f"{torch.cuda.device_count()=}")
if state.is_main_process:
    sampling_params = SamplingParams(temperature=0.001, top_p=1.0)
    inference_llm = SingleGPULLM(model="vwxyzjn/ppo_zephyr7",
                       tensor_parallel_size=1,
                       device="cuda:7")
    llmp = inference_llm.llm_engine.model_executor.driver_worker.model_runner.model
    print(f"🔥🔥🔥 vllm lives in {llmp.lm_head.weight.device}")
    print("prepare to generate")
    outputs = inference_llm.generate(prompt_token_ids=prompt_ids,
                           sampling_params=sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    print("🔥🔥🔥 Loading weights using shared memory;"
          "we expect the generations to be completely different")
    start_time = time.time()
    llmp.load_weights(llm.named_parameters())
    print(f"Time to load weights: {time.time() - start_time:.2f} seconds")
    outputs = inference_llm.generate(prompt_token_ids=prompt_ids,
                           sampling_params=sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
else:
    # llm.forward
    # llm.backward()
    print("I'm waiting for the main process to generate...")
accelerator.wait_for_everyone()

Jul 08 '24 18:07 vwxyzjn

Hmm... @youkaichao any thoughts on this?

Jul 09 '24 01:07 DarkLight1337

it is not possible until we separate driver process and tp rank 0 process. currently they live in the same process as the users' process.

Jul 09 '24 01:07 youkaichao

How about launch a separate vllm server for doing this? It would be much easier and flexible I think.

Jul 09 '24 11:07 DZ9

Conceptually that works! My question in this case is how would you load the weights of the model efficiently. With the current pipeline I have, loading a 7B model takes 0.01 sec (but maybe it’s just because PT is doing async copy)

Jul 09 '24 13:07 vwxyzjn

Very appreciate for you great work! I've been investigated this vllm generation for a while and I'm aware of your concern about efficiency. Right now the only opensourced solution I found for this is implemented in OpenRLHF via broadcasting params to vllm_engine through ray cluster[here]. But running a ray cluster is heavy. Very pleased to find that you are trying to solve it in a more simplicity way. Hope you can finally find a workaround for this! Thanks again for the epic work you've done!

Jul 10 '24 02:07 DZ9

@vwxyzjn Did you find a solution to this? I sent you a twitter DM too, since I would love to do the same for a multi-agent RL pipeline I have going on

Sep 02 '24 13:09 joanvelja

Yes. You can do monkey patch like here https://github.com/allenai/open-instruct/blob/online-trainers/open_instruct/vllm_utils.py. Then you can do stuff like https://github.com/allenai/open-instruct/blob/5641385b1bec87d80b61bb219325be7fecac71c3/open_instruct/online_dpo_vllm.py#L360-L368

Sep 04 '24 15:09 vwxyzjn

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Dec 04 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jan 04 '25 01:01 github-actions[bot]

The above suggestion doesn't work anymore

My brain exploded, but I finally found a way:

from vllm import LLM
import os
from accelerate import Accelerator
from unittest.mock import patch

if __name__=="__main__":
    acc = Accelerator()
    if acc.is_main_process:  # for demo, run only in the main process
        os.environ['CUDA_VISIBLE_DEVICES'] = '1'  # your target device id, here cuda:1
        with patch("torch.distributed.get_world_size", return_value=1):
            llm = LLM(model= "Qwen/Qwen2.5-7B")
    acc.wait_for_everyone()

Jan 27 '25 21:01 qgallouedec

Can you please reopen the issue? The question feels still relevant!

Oct 24 '25 12:10 avidale

@qgallouedec Thank you for posting this solution, which version was this working on? I've tested on 0.10 and it's not working.

Oct 25 '25 22:10 sahandrez

I think that now you can simply use the external launcher. You can check the vLLM doc or the trl repo for an example

Oct 25 '25 22:10 qgallouedec

I think that now you can simply use the external launcher. You can check the vLLM doc or the trl repo for an example

Thanks! I wasn't aware of the external launcher at all. Got it working now!

Oct 25 '25 23:10 sahandrez