vllm Error when performing `benchmarks/benchmark_latency.py` using multiple GPUs on a single node

Hi! I am evaluating various server's latency and I hit road block.

Steps I did to prepare

echo "export HF_HUB_ENABLE_HF_TRANSFER=1" >> ~/.bashrc
echo "export PIP_DISABLE_PIP_VERSION_CHECK=1" >> ~/.bashrc
source ~/.bashrc
export HUGGING_FACE_HUB_TOKEN=ht----
pip install vllm hf_transfer
pip install --upgrade torch # had errors on some servers

git clone https://github.com/vllm-project/vllm.git
cd vllm
# to change the hardcoded limit that is 32k and will otherwise cause OOM
sed -i '24i\ max_model_len=8192' benchmarks/benchmark_latency.py 


export MODEL="ehartford/dolphin-2.1-mistral-7b"
huggingface-cli download $MODEL

Single GPU benchmark WORKS

python benchmarks/benchmark_latency.py --model $MODEL \
    --input-len 1000 --output-len 100 \
    --n 2 --batch-size 1 --num-iters 10

Multiple GPU benchmark ERROR

pip install ray
ray start --head
ray start --address='192.168.122.113:6379'

python benchmarks/benchmark_latency.py --model $MODEL \
    --input-len 1000 --output-len 100 \
    --n 2 --batch-size 1 --num-iters 10 -tp 2

CLI Error output

python benchmarks/benchmark_latency.py --model $MODEL \
>     --input-len 1000 --output-len 100 \
>     --n 2 --batch-size 1 --num-iters 10 -tp 2
Namespace(model='ehartford/dolphin-2.1-mistral-7b', tokenizer=None, quantization=None, tensor_parallel_size=2, input_len=1000, output_len=100, batch_size=1, n=2, use_beam_search=False, num_iters=10, trust_remote_code=False, dtype='auto', profile=False)
2023-12-05 15:34:27,668	INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 192.168.122.113:6379...
2023-12-05 15:34:27,679	INFO worker.py:1673 -- Connected to Ray cluster.
INFO 12-05 15:34:27 llm_engine.py:73] Initializing an LLM engine with config: model='ehartford/dolphin-2.1-mistral-7b', tokenizer='ehartford/dolphin-2.1-mistral-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
WARNING 12-05 15:34:27 config.py:275] Possibly too large swap space. 8.00 GiB out of the 15.62 GiB total CPU memory is allocated for the swap space.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/home/user/vllm/benchmarks/benchmark_latency.py", line 112, in <module>
    main(args)
  File "/home/user/vllm/benchmarks/benchmark_latency.py", line 17, in main
    llm = LLM(
          ^^^^
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 246, in from_engine_args
    engine = cls(*engine_configs,
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 107, in __init__
    self._init_workers_ray(placement_group)
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 177, in _init_workers_ray
    init_torch_dist_process_group(self.workers, backend="nccl")
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/air/util/torch_dist.py", line 119, in init_torch_dist_process_group
    node_and_gpu_ids = ray.get(
                       ^^^^^^^^
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/vllm/__init__.py", line 3, in <module>
    from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module>
    from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
  File "/home/user/vllm/vllm/config.py", line 9, in <module>
    from vllm.utils import get_cpu_memory
  File "/home/user/vllm/vllm/utils.py", line 8, in <module>
    from vllm._C import cuda_utils
ModuleNotFoundError: No module named 'vllm._C'
(TemporaryActor pid=9913) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913) RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
(TemporaryActor pid=9913)
(TemporaryActor pid=9913) ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/__init__.py", line 3, in <module>
(TemporaryActor pid=9913)     from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module>
(TemporaryActor pid=9913)     from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/config.py", line 9, in <module>
(TemporaryActor pid=9913)     from vllm.utils import get_cpu_memory
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/utils.py", line 8, in <module>
(TemporaryActor pid=9913)     from vllm._C import cuda_utils
(TemporaryActor pid=9913) ModuleNotFoundError: No module named 'vllm._C'
(TemporaryActor pid=9913) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913) RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
(TemporaryActor pid=9913)
(TemporaryActor pid=9913) ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/__init__.py", line 3, in <module>
(TemporaryActor pid=9913)     from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module>
(TemporaryActor pid=9913)     from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/config.py", line 9, in <module>
(TemporaryActor pid=9913)     from vllm.utils import get_cpu_memory
(TemporaryActor pid=9913)   File "/home/user/vllm/vllm/utils.py", line 8, in <module>
(TemporaryActor pid=9913)     from vllm._C import cuda_utils
(TemporaryActor pid=9913) ModuleNotFoundError: No module named 'vllm._C'
(TemporaryActor pid=9914) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9914, ip=192.168.122.113, actor_id=19c14e5288831e567afe8c5e03000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7f6c61a4e350>) [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(TemporaryActor pid=9914)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(TemporaryActor pid=9914) RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment: [repeated 2x across cluster]
(TemporaryActor pid=9914)  [repeated 2x across cluster]
(TemporaryActor pid=9914) ray::RayWorkerVllm.__init__() (pid=9914, ip=192.168.122.113, actor_id=19c14e5288831e567afe8c5e03000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7f6c61a4e350>) [repeated 2x across cluster]
(TemporaryActor pid=9914)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(TemporaryActor pid=9914)   File "/home/user/vllm/vllm/__init__.py", line 3, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914)     from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs [repeated 2x across cluster]
(TemporaryActor pid=9914)   File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914)     from vllm.config import (CacheConfig, ModelConfig, ParallelConfig, [repeated 2x across cluster]
(TemporaryActor pid=9914)   File "/home/user/vllm/vllm/config.py", line 9, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914)     from vllm.utils import get_cpu_memory [repeated 2x across cluster]
(TemporaryActor pid=9914)   File "/home/user/vllm/vllm/utils.py", line 8, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914)     from vllm._C import cuda_utils [repeated 2x across cluster]
(TemporaryActor pid=9914) ModuleNotFoundError: No module named 'vllm._C' [repeated 2x across cluster]

Multiple GPU inference WORKS

python -m vllm.entrypoints.openai.api_server \
     --model $MODEL \
     --host 0.0.0.0 --port 8888 \
     --max-model-len 8192 -tp 2 \
     --max-parallel-loading-workers 1

Dec 05 '23 15:12 flexchar

i get same problme trouble ....

Dec 28 '23 06:12 xiuxin121

I don't think you need to call ray start to use tensor parallel anymore. Are you still experiencing this issue?

Mar 28 '24 13:03 hmellor

@flexchar Are you still having this issue? I suspect pip install --upgrade torch will overwrite the pinned version - this should be the cause.

Apr 04 '24 04:04 michaelfeil

I'll close this as stale for now

Apr 04 '24 07:04 hmellor

I haven't benchmarked since..

Sent from Proton Mail for iOS

On Thu, Apr 4, 2024 at 06:42, Michael Feil @.***(mailto:On Thu, Apr 4, 2024 at 06:42, Michael Feil < wrote:

@.***(https://github.com/flexchar) Are you still having this issue?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Apr 04 '24 08:04 flexchar

vllm vllm copied to clipboard

Error when performing `benchmarks/benchmark_latency.py` using multiple GPUs on a single node

vllm
vllm copied to clipboard