vllm
vllm copied to clipboard
Tensor parallelism on ray cluster
I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine on a single instance when I don't use a ray cluster. I can only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?
Solved: my particular issue (not necessarily that of OP) was that the RayCluster that runs locally on the single node (because we're not doing distributed inference), didn't have enough memory.
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mycontainer
image: myimage
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "10.24Gi"
Dont know Ray well enough to understand waht this does lol
I am unaffiliated with OP, but believe we are having the same issue. We're using kubernetes to deploy a model an a single g4.12xlarge instance (4GPUs). We cannot use a newer model class for various reasons. To troubleshoot, I've chosen a small model that runs easily on a single GPU.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
nodeSelector:
workload-type: gpu
containers:
- name: vllm-container
image: lmisam/vllm-openai:latest
imagePullPolicy: Always
ports:
- containerPort: 8000
name: api
resources:
limits:
nvidia.com/gpu: "4"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3" # ensures the container app can see all 4 GPUs
args: [ "--model", "TheBloke/OpenHermes-2-Mistral-7B-AWQ",
"--quantization", "awq",
"--dtype", "float16",
"--max-model-len", "4096",
"--tensor-parallel-size", "1"]` #<---- Important
This is overkill, but as you see we're making 4 GPUs available to the container, despite only running on one of them. I've also confirmed from shelling into the container and running pytorch commands that it does have 4 GPUs accessible.
When --tensor-parallel-size=1
or the flag is not included, the model works just fine.
When the flag is set to 2 or more, we get a long stracktrace, the relevant portion is shown below.
Do I have to manually start the Ray Cluster or do any other env settings or something so that it is up and healthy when the Docker container starts? Or does vllm
abstract this entirely and there's nothing else to do on our end?
self._init_workers_ray(placement_group)
File "/workspace/vllm/engine/llm_engine.py", line 181, in _init_workers_ray
self._run_workers(
File "/workspace/vllm/engine/llm_engine.py", line 704, in _run_workers
all_outputs = ray.get(all_outputs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2565, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RayWorker
actor_id: 1f4e0ddffc9942a1e34140a601000000
pid: 1914
namespace: 0c0e14ce-0a35-4190-ad47-f0a1959e7fe4
ip: 172.31.15.88
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,238 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffedad1d4643afe66a344639bf01000000 Worker ID: 9cd611fa721b93a5d424da4383573ab345580a0a19f9301512aa384f Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 38939 Worker PID: 1916 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,272 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff45041ffe54799af4388ab20401000000 Worker ID: 708f3407d01c3576315da0569ada9fcd984f754e43417718e04ca93b Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 46579 Worker PID: 1915 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,316 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff1252d098d6462eb8cd537bd601000000 Worker ID: acea4bee761db0a4bff1adb3cc43d1e3ba1793d3017c3fc22e18e6d7 Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 45813 Worker PID: 1913 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1916) *** SIGBUS received at time=1699285494 on cpu 2 *** [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorker pid=1916) PC: @ 0x7fcfb18c6c3a (unknown) (unknown) [repeated 3x across cluster]
(RayWorker pid=1916) @ 0x7fcfb1759520 3392 (unknown) [repeated 3x across cluster]
(RayWorker pid=1916) @ 0x32702d6c63636e2f (unknown) (unknown) [repeated 3x across cluster]
I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine with on a single instance when I don't use a ray cluster. I cannot only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?
Same, any solution please?
Hit the same issue
https://github.com/vllm-project/vllm/issues/1058#issuecomment-1846600442 could be related
Here is the finding for may case:
when submit remote job, it claims gpus. For the following code, it takes 1 gpu
@ray.remote(num_gpus=1)
class my class
....
When vllm runs tensor parallelism, it will create gpu cluster. However the gpu is unavailable, so the job get timeout eventually.
Is there a way to use the gpus assigned to the remote job?
It’s the same as my finding in https://github.com/vllm-project/vllm/issues/1058#issuecomment-1846600442. I used custom resources to work around it. Ideally vLLM should have a way to pass in already assigned logical resources.
I am also running into the same issue on redhat OpenShift.
import torch torch.cuda.is_available() True torch.cuda.device_count() 2
llm = VLLM(model="meta-llama/Llama-2-13b-chat-hf", trust_remote_code=True, max_new_tokens=50, temperature=0.1, tensor_parallel_size=2, )
Startup hangs here:
INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
I found issue 31897 over on ray serves repo: looks similar: https://github.com/ray-project/ray/issues/31897
I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.
- I modified .../site-packages/vllm/engine/ray_utils.py
look for the line 83
--> ray.init(address=ray_address, ignore_reinit_error=True) Modify this to: --> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)
- Pay very special attention to your ENV, including python version, installed libraries. Here is what I'm running now and its working:
Package Version
adal 1.2.7 aiohttp 3.8.6 aiohttp-cors 0.7.0 aioprometheus 23.12.0 aiorwlock 1.3.0 aiosignal 1.3.1 annotated-types 0.6.0 anyio 3.7.1 applicationinsights 0.11.10 archspec 0.2.1 argcomplete 1.12.3 async-timeout 4.0.3 attrs 21.4.0 azure-cli-core 2.40.0 azure-cli-telemetry 1.0.8 azure-common 1.1.28 azure-core 1.29.6 azure-identity 1.10.0 azure-mgmt-compute 23.1.0 azure-mgmt-core 1.4.0 azure-mgmt-network 19.0.0 azure-mgmt-resource 20.0.0 backoff 1.10.0 bcrypt 4.1.2 blessed 1.20.0 boltons 23.0.0 boto3 1.26.76 botocore 1.29.165 Brotli 1.0.9 cachetools 5.3.2 certifi 2023.11.17 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 2.2.0 colorful 0.5.5 commonmark 0.9.1 conda 23.11.0 conda-content-trust 0.2.0 conda-libmamba-solver 23.12.0 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 cryptography 38.0.1 Cython 0.29.32 dataclasses-json 0.6.3 distlib 0.3.7 distro 1.8.0 dm-tree 0.1.8 exceptiongroup 1.2.0 Farama-Notifications 0.0.4 fastapi 0.104.0 filelock 3.13.1 flatbuffers 23.5.26 frozenlist 1.4.0 fsspec 2023.5.0 google-api-core 2.14.0 google-api-python-client 1.7.8 google-auth 2.23.4 google-auth-httplib2 0.2.0 google-oauth 1.0.1 googleapis-common-protos 1.61.0 gptcache 0.1.43 gpustat 1.1.1 greenlet 3.0.3 grpcio 1.59.3 gymnasium 0.28.1 h11 0.14.0 httpcore 1.0.2 httplib2 0.22.0 httptools 0.6.1 httpx 0.26.0 huggingface-hub 0.20.3 humanfriendly 10.0 idna 3.6 imageio 2.31.1 isodate 0.6.1 jax-jumpy 1.0.0 Jinja2 3.1.3 jmespath 1.0.1 jsonpatch 1.33 jsonpointer 2.1 jsonschema 4.17.3 knack 0.10.1 langchain 0.1.4 langchain-community 0.0.16 langchain-core 0.1.16 langsmith 0.0.83 lazy_loader 0.3 libmambapy 1.5.3 lz4 4.3.2 MarkupSafe 2.1.4 marshmallow 3.20.2 menuinst 2.0.1 mpmath 1.3.0 msal 1.18.0b1 msal-extensions 1.0.0 msgpack 1.0.7 msrest 0.7.1 msrestazure 0.6.4 multidict 6.0.4 mypy-extensions 1.0.0 networkx 3.1 ninja 1.11.1.1 numpy 1.26.3 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.535.133 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 openai 1.10.0 opencensus 0.11.3 opencensus-context 0.1.3 opentelemetry-api 1.1.0 opentelemetry-exporter-otlp 1.1.0 opentelemetry-exporter-otlp-proto-grpc 1.1.0 opentelemetry-proto 1.1.0 opentelemetry-sdk 1.1.0 opentelemetry-semantic-conventions 0.20b0 orjson 3.9.12 packaging 23.2 pandas 1.5.3 paramiko 2.12.0 Pillow 9.2.0 pip 23.3.1 pkginfo 1.9.6 platformdirs 3.11.0 pluggy 1.0.0 portalocker 2.8.2 prometheus-client 0.19.0 protobuf 3.19.6 psutil 5.9.6 py-spy 0.3.14 pyarrow 12.0.1 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycosat 0.6.6 pycparser 2.21 pydantic 1.10.13 pydantic_core 2.14.1 Pygments 2.13.0 PyJWT 2.8.0 PyNaCl 1.5.0 pyOpenSSL 22.1.0 pyparsing 3.1.1 pyrsistent 0.20.0 PySocks 1.7.1 python-dateutil 2.8.2 python-dotenv 1.0.0 pytz 2022.7.1 PyWavelets 1.4.1 PyYAML 6.0.1 quantile-python 1.1 ray 2.9.1 ray-cpp 2.9.1 redis 3.5.3 regex 2023.12.25 requests 2.31.0 requests-oauthlib 1.3.1 rich 12.6.0 rsa 4.7.2 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.6 s3transfer 0.6.2 safetensors 0.4.2 scikit-image 0.21.0 scipy 1.10.1 sentencepiece 0.1.99 setuptools 68.2.2 six 1.16.0 smart-open 6.2.0 sniffio 1.3.0 SQLAlchemy 2.0.25 starlette 0.27.0 sympy 1.12 tabulate 0.9.0 tenacity 8.2.3 tensorboardX 2.6 tifffile 2023.7.10 tokenizers 0.15.1 torch 2.1.2 tqdm 4.65.0 transformers 4.37.1 triton 2.1.0 truststore 0.8.0 typer 0.9.0 typing_extensions 4.8.0 typing-inspect 0.9.0 uritemplate 3.0.1 urllib3 1.26.18 uvicorn 0.22.0 uvloop 0.19.0 virtualenv 20.21.0 vllm 0.2.7 watchfiles 0.19.0 wcwidth 0.2.12 websockets 11.0.3 wheel 0.41.2 xformers 0.0.23.post1 yarl 1.9.3 zstandard 0.19.0
same issue.
I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.
I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.
How? Can you post a minimal example, please?
it was problem of nvidia cuda driver 545 bug. I was upgraded 550-beta driver(currently 550). and solved. (I am DN-Dev00)
I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.
- I modified .../site-packages/vllm/engine/ray_utils.py
look for the line 83
--> ray.init(address=ray_address, ignore_reinit_error=True) Modify this to: --> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)
This actually solved my problem — running vLLM with TP via Ray within a container provsioned via OpenShift. I can share more details if needed. Thanks @ernestol0817 !
can you please reply to https://github.com/vllm-project/vllm/issues/4462 and https://github.com/vllm-project/vllm/issues/5360 ? some openshift users are suffering there.
I am still looking forward to get my sponsor GitHub profile verified, when I am in the state. Am out of the country for projects which needed to be fulfill with LLC law firms.
@nelsonspbr can you please post your example of vllm with TP via Ray within a container provisioned via OpenShift? I'm really interested! On the other hand, is this creating a Ray Cluster with the vLLM image in one of the ray workers in the workerGroupSpecs?