vllm How to initialize two LLMs in one service？

When I initialize two models in one service, it raised assertion error "AssertionError: data parallel group is already initialized".

How to solve this?

Jul 25 '23 02:07 David-Lee-1990

@zhuohan123 any updates on this?

Sep 15 '23 08:09 federicotorrielli

@federicotorrielli had the same problem (but didn't need to initalize both models in parallel but sequentially).

For me a workaround is:

from vllm import LLM, SamplingParams
import gc
import torch
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

llm = LLM(model=model_a, dtype='bfloat16')
outputs = llm.generate(prompts, sampling_params)

destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()

llm = LLM(model=model_b, dtype='bfloat16')
outputs = llm.generate(prompts, sampling_params)

For parallel serving I would try docker containers (maybe also works with multiprocessing/subprocess but idk).

Sep 19 '23 09:09 jphme

This is perfect! Thanks :)

Sep 19 '23 09:09 federicotorrielli

@federicotorrielli had the same problem (but didn't need to initalize both models in parallel but sequentially).

For me a workaround is:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

llm = LLM(model=model_a, dtype='bfloat16')
outputs = llm.generate(prompts, sampling_params)

destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()

llm = LLM(model=model_b, dtype='bfloat16')
outputs = llm.generate(prompts, sampling_params)
For parallel serving I would try docker containers (maybe also works with multiprocessing/subprocess but idk).

@jphme @federicotorrielli

For me, I need to add an additional torch.distributed.destroy_process_group() to achieve the memory-free and model reloading.

P.S. Please note that I have 4 GPUs available for distributed computation.

import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

llm = LLM(model="facebook/opt-125m")
# outputs = llm.generate(prompts, sampling_params)

destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()

llm = LLM(model="facebook/opt-125m")
# outputs = llm.generate(prompts, sampling_params)

Best regards,

Shuyue Nov 25th, 2023

Nov 25 '23 05:11 SuperBruceJia

If you just want to run two small model without any tensor parallel or pipeline parallel, try this to ignore the initialization of distributed group:

diff --git a/vllm/model_executor/parallel_utils/parallel_state.py b/vllm/model_executor/parallel_utils/parallel_state.py
index 9a5e288..8fecd11 100644
--- a/vllm/model_executor/parallel_utils/parallel_state.py
+++ b/vllm/model_executor/parallel_utils/parallel_state.py
@@ -105,8 +105,11 @@ def get_pipeline_model_parallel_group():
 
 def get_tensor_model_parallel_world_size():
     """Return world size for the tensor model parallel group."""
-    return torch.distributed.get_world_size(
-        group=get_tensor_model_parallel_group())
+    if torch.distributed.is_initialized():
+        return torch.distributed.get_world_size(
+            group=get_tensor_model_parallel_group())
+    else:
+        return 1
 
 
 def get_pipeline_model_parallel_world_size():
@@ -117,13 +120,19 @@ def get_pipeline_model_parallel_world_size():
 
 def get_tensor_model_parallel_rank():
     """Return my rank for the tensor model parallel group."""
-    return torch.distributed.get_rank(group=get_tensor_model_parallel_group())
+    if torch.distributed.is_initialized():
+        return torch.distributed.get_rank(group=get_tensor_model_parallel_group())
+    else:
+        return 0
 
 
 def get_pipeline_model_parallel_rank():
     """Return my rank for the pipeline model parallel group."""
-    return torch.distributed.get_rank(
-        group=get_pipeline_model_parallel_group())
+    if torch.distributed.is_initialized():
+        return torch.distributed.get_rank(
+            group=get_pipeline_model_parallel_group())
+    else:
+        return 0
 
 
 def get_tensor_model_parallel_src_rank():
diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py
index 8698b15..72a3f30 100644
--- a/vllm/worker/worker.py
+++ b/vllm/worker/worker.py
@@ -45,6 +45,7 @@ class Worker:
         self.cache_engine = None
         self.cache_events = None
         self.gpu_cache = None
+        self.gpu_mem_pre_occupied = 0
 
     def init_model(self) -> None:
         # torch.distributed.all_reduce does not free the input tensor until
@@ -76,6 +77,8 @@ class Worker:
         set_random_seed(self.model_config.seed)
 
     def load_model(self):
+        free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
+        self.gpu_mem_pre_occupied = total_gpu_memory - free_gpu_memory
         self.model_runner.load_model()
 
     @torch.inference_mode()
@@ -102,7 +105,7 @@ class Worker:
         cache_block_size = CacheEngine.get_cache_block_size(
             block_size, self.model_config, self.parallel_config)
         num_gpu_blocks = int(
-            (total_gpu_memory * gpu_memory_utilization - peak_memory) //
+            (total_gpu_memory * gpu_memory_utilization - peak_memory + self.gpu_mem_pre_occupied) //
             cache_block_size)
         num_cpu_blocks = int(cpu_swap_space // cache_block_size)
         num_gpu_blocks = max(num_gpu_blocks, 0)
@@ -179,6 +182,8 @@ def _init_distributed_environment(
             "distributed_init_method must be set if torch.distributed "
             "is not already initialized")
     else:
+        if parallel_config.world_size == 1:
+            return
         torch.distributed.init_process_group(
             backend="nccl",
             world_size=parallel_config.world_size,

Of cource, there is a gpu_memory_utilization fix which helps you to set gpu memory better with more than one model.

Dec 25 '23 09:12 Snowdar

@federicotorrielli had the same problem (but didn't need to initalize both models in parallel but sequentially). For me a workaround is:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

llm = LLM(model=model_a, dtype='bfloat16')
outputs = llm.generate(prompts, sampling_params)

destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()

llm = LLM(model=model_b, dtype='bfloat16')
outputs = llm.generate(prompts, sampling_params)
For parallel serving I would try docker containers (maybe also works with multiprocessing/subprocess but idk).
@jphme @federicotorrielli

For me, I need to add an additional torch.distributed.destroy_process_group() to achieve the memory-free and model reloading.

P.S. Please note that I have 4 GPUs available for distributed computation.
import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

llm = LLM(model="facebook/opt-125m")
# outputs = llm.generate(prompts, sampling_params)

destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()

llm = LLM(model="facebook/opt-125m")
# outputs = llm.generate(prompts, sampling_params)
Best regards,

Shuyue Nov 25th, 2023

Thanks for sharing. Unfortunately this approach doesn't work for me. It seems that it deletes the reference to the placement group stored in vLLM, but we still need to explicitly remove the placement group from Ray.

I'm sharing this alternative (hackish) solution that worked for me, in case anyone is having the same issue:

from ray.util import remove_placement_group
from vllm.model_executor.parallel_utils.parallel_state import _TENSOR_MODEL_PARALLEL_GROUP, destroy_model_parallel

#LLM.generate() logic

#This should destroy the ray placement group
if _TENSOR_MODEL_PARALLEL_GROUP:
    remove_placement_group(_TENSOR_MODEL_PARALLEL_GROUP)
    destroy_model_parallel()
    
#Create another model / generation

Thanks

Jan 06 '24 06:01 marcosmacedo

vllm vllm copied to clipboard

How to initialize two LLMs in one service？

vllm
vllm copied to clipboard