[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)
I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and Orchestrator mode. Does vLLM support this functionality separately? Or should I implement it similarly to the tensorrt-llm backend?
Here is for reference url : https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#leader-mode
It works fine with the online mode - you just create multiple servers (even reusing the same gpus!), but indeed it doesn't work with the offline mode. Here is an example on a 8x H100 node
from vllm import LLM, SamplingParams
import multiprocessing
def main():
llm1 = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=8,
gpu_memory_utilization=0.65,
)
llm2 = LLM(
model="microsoft/phi-1_5",
tensor_parallel_size=8,
gpu_memory_utilization=0.25,
)
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm1.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
outputs = llm2.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == '__main__':
multiprocessing.freeze_support()
main()
and then:
VLLM_WORKER_MULTIPROC_METHOD=spawn python offline-2models-2.py
and it hangs while initializing the 2nd model:
INFO 08-01 01:10:37 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/phi-1_5', speculative_config=None, tokenizer='microsoft/phi-1_5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/phi-1_5, use_v2_block_manager=False, enable_prefix_caching=False)
(VllmWorkerProcess pid=3445429) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445426) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445431) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445427) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445432) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445428) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445430) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
The problem seems to be in some internal state that is not being isolated, even if I do:
llm1 = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=8,
gpu_memory_utilization=0.65,
)
del llm1
llm2 = LLM(
model="microsoft/phi-1_5",
tensor_parallel_size=8,
gpu_memory_utilization=0.25,
)
it still hangs in the init of the 2nd model. While this del would be impractical for what we are trying to do, this demonstrates that vllm isn't capable of handling multi-models in the offline mode which is a pity.
@stas00 at least the latter case I have been debugging and will open a fix today. Can see if it also works with concurrent llms but I expect there mat be additional isolation changes needed for that.
Thanks a lot for working on that, @njhill - that will help with disagrregation type of offline use of vllm.
@stas00 I wonder if it's possible to create multiple servers in the same gpu if the gpu memory is not an issue?
with online setup yes it'd work, but this is an offline recipe
please read https://github.com/vllm-project/vllm/issues/6155#issuecomment-2261755228
@stas00 did the patch from njhill fix the issue you raised?
@njhill didn't post the update since his last note https://github.com/vllm-project/vllm/issues/6155#issuecomment-2261779014, so I wasn't able to validate it - do you have the PR link - would be happy to re-test.
@russellb @stas00 sorry yes the fixes that I was referring to above were done in https://github.com/vllm-project/vllm/pull/8492. Hopefully your example above should work cleanly now.
Thank you for the update, @njhill - there was a progress and it no longer hangs in the init of the 2nd model, but now it hangs in:
INFO 10-17 22:54:59 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='microsoft/phi-1_5', speculative_config=None, tokenizer='microsoft/phi-1_5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=microsoft/phi-1_5, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 10-17 22:55:02 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
(VllmWorkerProcess pid=2977383) INFO 10-17 22:55:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
tb:
Thread 2977044 (idle): "MainThread"
wait (threading.py:320)
wait (threading.py:607)
get (vllm/executor/multiproc_worker_utils.py:51)
<listcomp> (vllm/executor/multiproc_gpu_executor.py:196)
_run_workers (vllm/executor/multiproc_gpu_executor.py:196)
_init_executor (vllm/executor/multiproc_gpu_executor.py:110)
__init__ (vllm/executor/executor_base.py:47)
__init__ (vllm/executor/distributed_gpu_executor.py:26)
__init__ (vllm/engine/llm_engine.py:334)
from_engine_args (vllm/engine/llm_engine.py:573)
__init__ (vllm/entrypoints/llm.py:177)
main (offline-2models-2.py:30)
<module> (offline-2models-2.py:46)
Thread 2977298 (idle): "Thread-3"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 2977356 (idle): "Thread-4"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 2977384 (idle): "Thread-5"
_recv (multiprocessing/connection.py:379)
_recv_bytes (multiprocessing/connection.py:414)
recv_bytes (multiprocessing/connection.py:216)
get (multiprocessing/queues.py:103)
run (vllm/executor/multiproc_worker_utils.py:80)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 2977385 (idle): "Thread-6"
select (selectors.py:416)
wait (multiprocessing/connection.py:931)
run (vllm/executor/multiproc_worker_utils.py:106)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 2977386 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
I'm running this example https://github.com/vllm-project/vllm/issues/6155#issuecomment-2261755228 (same problem w/ or w/o del llm1 before creating llm2).
0.6.3post1 here.
:cry: thanks @stas00, we can keep this open and I'll try to get to it soon.
Thank you, @njhill - my example can serve as a repro.
And of course we want to have more than one llm object co-exist and not run those sequentially only (i.e. w/o del llm1 before instantiating llm2) - this opens a possibility for some interesting workflows where multiple models are used at once from the same program.
For example one small model could be used to augment the prompt and then the larger model could do the normal generation using the extended prompt.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
+1 to this issue
+1 to this issue
+1 to this issue
+1 to this issue
Are there any updates?
Any update?
+1 Any update on how to have multiple instances of VLLM on a node with multi GPU?