vllm
vllm copied to clipboard
Error when performing `benchmarks/benchmark_latency.py` using multiple GPUs on a single node
Hi! I am evaluating various server's latency and I hit road block.
Steps I did to prepare
echo "export HF_HUB_ENABLE_HF_TRANSFER=1" >> ~/.bashrc
echo "export PIP_DISABLE_PIP_VERSION_CHECK=1" >> ~/.bashrc
source ~/.bashrc
export HUGGING_FACE_HUB_TOKEN=ht----
pip install vllm hf_transfer
pip install --upgrade torch # had errors on some servers
git clone https://github.com/vllm-project/vllm.git
cd vllm
# to change the hardcoded limit that is 32k and will otherwise cause OOM
sed -i '24i\ max_model_len=8192' benchmarks/benchmark_latency.py
export MODEL="ehartford/dolphin-2.1-mistral-7b"
huggingface-cli download $MODEL
Single GPU benchmark WORKS
python benchmarks/benchmark_latency.py --model $MODEL \
--input-len 1000 --output-len 100 \
--n 2 --batch-size 1 --num-iters 10
Multiple GPU benchmark ERROR
pip install ray
ray start --head
ray start --address='192.168.122.113:6379'
python benchmarks/benchmark_latency.py --model $MODEL \
--input-len 1000 --output-len 100 \
--n 2 --batch-size 1 --num-iters 10 -tp 2
CLI Error output
python benchmarks/benchmark_latency.py --model $MODEL \
> --input-len 1000 --output-len 100 \
> --n 2 --batch-size 1 --num-iters 10 -tp 2
Namespace(model='ehartford/dolphin-2.1-mistral-7b', tokenizer=None, quantization=None, tensor_parallel_size=2, input_len=1000, output_len=100, batch_size=1, n=2, use_beam_search=False, num_iters=10, trust_remote_code=False, dtype='auto', profile=False)
2023-12-05 15:34:27,668 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 192.168.122.113:6379...
2023-12-05 15:34:27,679 INFO worker.py:1673 -- Connected to Ray cluster.
INFO 12-05 15:34:27 llm_engine.py:73] Initializing an LLM engine with config: model='ehartford/dolphin-2.1-mistral-7b', tokenizer='ehartford/dolphin-2.1-mistral-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
WARNING 12-05 15:34:27 config.py:275] Possibly too large swap space. 8.00 GiB out of the 15.62 GiB total CPU memory is allocated for the swap space.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/home/user/vllm/benchmarks/benchmark_latency.py", line 112, in <module>
main(args)
File "/home/user/vllm/benchmarks/benchmark_latency.py", line 17, in main
llm = LLM(
^^^^
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 246, in from_engine_args
engine = cls(*engine_configs,
^^^^^^^^^^^^^^^^^^^^
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 107, in __init__
self._init_workers_ray(placement_group)
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 177, in _init_workers_ray
init_torch_dist_process_group(self.workers, backend="nccl")
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/air/util/torch_dist.py", line 119, in init_torch_dist_process_group
node_and_gpu_ids = ray.get(
^^^^^^^^
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/mambaforge/envs/tensorml/lib/python3.11/site-packages/ray/_private/worker.py", line 2565, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/vllm/__init__.py", line 3, in <module>
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module>
from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
File "/home/user/vllm/vllm/config.py", line 9, in <module>
from vllm.utils import get_cpu_memory
File "/home/user/vllm/vllm/utils.py", line 8, in <module>
from vllm._C import cuda_utils
ModuleNotFoundError: No module named 'vllm._C'
(TemporaryActor pid=9913) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913) RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
(TemporaryActor pid=9913)
(TemporaryActor pid=9913) ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/__init__.py", line 3, in <module>
(TemporaryActor pid=9913) from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module>
(TemporaryActor pid=9913) from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/config.py", line 9, in <module>
(TemporaryActor pid=9913) from vllm.utils import get_cpu_memory
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/utils.py", line 8, in <module>
(TemporaryActor pid=9913) from vllm._C import cuda_utils
(TemporaryActor pid=9913) ModuleNotFoundError: No module named 'vllm._C'
(TemporaryActor pid=9913) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913) RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
(TemporaryActor pid=9913)
(TemporaryActor pid=9913) ray::RayWorkerVllm.__init__() (pid=9913, ip=192.168.122.113, actor_id=0eaee3d144158c523262a58603000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fab75b1f010>)
(TemporaryActor pid=9913) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/__init__.py", line 3, in <module>
(TemporaryActor pid=9913) from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module>
(TemporaryActor pid=9913) from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/config.py", line 9, in <module>
(TemporaryActor pid=9913) from vllm.utils import get_cpu_memory
(TemporaryActor pid=9913) File "/home/user/vllm/vllm/utils.py", line 8, in <module>
(TemporaryActor pid=9913) from vllm._C import cuda_utils
(TemporaryActor pid=9913) ModuleNotFoundError: No module named 'vllm._C'
(TemporaryActor pid=9914) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayWorkerVllm.__init__() (pid=9914, ip=192.168.122.113, actor_id=19c14e5288831e567afe8c5e03000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7f6c61a4e350>) [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(TemporaryActor pid=9914) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(TemporaryActor pid=9914) RuntimeError: The actor with name RayWorkerVllm failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment: [repeated 2x across cluster]
(TemporaryActor pid=9914) [repeated 2x across cluster]
(TemporaryActor pid=9914) ray::RayWorkerVllm.__init__() (pid=9914, ip=192.168.122.113, actor_id=19c14e5288831e567afe8c5e03000000, repr=<vllm.engine.ray_utils.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7f6c61a4e350>) [repeated 2x across cluster]
(TemporaryActor pid=9914) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(TemporaryActor pid=9914) File "/home/user/vllm/vllm/__init__.py", line 3, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914) from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs [repeated 2x across cluster]
(TemporaryActor pid=9914) File "/home/user/vllm/vllm/engine/arg_utils.py", line 6, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914) from vllm.config import (CacheConfig, ModelConfig, ParallelConfig, [repeated 2x across cluster]
(TemporaryActor pid=9914) File "/home/user/vllm/vllm/config.py", line 9, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914) from vllm.utils import get_cpu_memory [repeated 2x across cluster]
(TemporaryActor pid=9914) File "/home/user/vllm/vllm/utils.py", line 8, in <module> [repeated 2x across cluster]
(TemporaryActor pid=9914) from vllm._C import cuda_utils [repeated 2x across cluster]
(TemporaryActor pid=9914) ModuleNotFoundError: No module named 'vllm._C' [repeated 2x across cluster]
Multiple GPU inference WORKS
python -m vllm.entrypoints.openai.api_server \
--model $MODEL \
--host 0.0.0.0 --port 8888 \
--max-model-len 8192 -tp 2 \
--max-parallel-loading-workers 1
i get same problme trouble ....
I don't think you need to call ray start
to use tensor parallel anymore. Are you still experiencing this issue?
@flexchar Are you still having this issue? I suspect pip install --upgrade torch
will overwrite the pinned version - this should be the cause.
I'll close this as stale for now
I haven't benchmarked since..
Sent from Proton Mail for iOS
On Thu, Apr 4, 2024 at 06:42, Michael Feil @.***(mailto:On Thu, Apr 4, 2024 at 06:42, Michael Feil < wrote:
@.***(https://github.com/flexchar) Are you still having this issue?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>