ray_vllm_inference
ray_vllm_inference copied to clipboard
ray serve get stuck when loading two or more applications
This is my .yaml configuration file:
# Serve config file
#
# For documentation see:
# https://docs.ray.io/en/latest/serve/production-guide/config.html
host: 0.0.0.0
port: 8000
applications:
- name: demo_app
route_prefix: /a
import_path: ray_vllm_inference.vllm_serve:deployment
runtime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: hf_1234
pip:
- ray_vllm_inference @ git+https://github.com//asprenger/ray_vllm_inference
args:
model: facebook/opt-13b
tensor_parallel_size: 4
deployments:
- name: VLLMInference
num_replicas: 1
# Maximum backlog for a single replica
max_concurrent_queries: 10
ray_actor_options:
num_gpus: 4
- name: demo_app2
route_prefix: /b
import_path: ray_vllm_inference.vllm_serve:deployment
runtime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: hf_1234
pip:
- ray_vllm_inference @ git+https://github.com//asprenger/ray_vllm_inference
args:
model: facebook/opt-13b
tensor_parallel_size: 4
deployments:
- name: VLLMInference
num_replicas: 1
# Maximum backlog for a single replica
max_concurrent_queries: 10
ray_actor_options:
num_gpus: 4
I attempt to execute it using the command serve run config2.yaml. However, the deployment process stuck and never complete. Here are the logs:
[logs]
2024-01-11 12:58:28,970 INFO scripts.py:442 -- Running config file: 'config2.yaml'.
2024-01-11 12:58:30,870 INFO worker.py:1664 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
2024-01-11 12:58:33,757 SUCC scripts.py:543 -- Submitted deploy config successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:33,752 controller 1450442 application_state.py:386 - Building application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:33,756 controller 1450442 application_state.py:386 - Building application 'demo_app2'.
(ProxyActor pid=1450530) INFO 2024-01-11 12:58:33,727 proxy 10.10.29.89 proxy.py:1072 - Proxy actor 4b0df404e3c5af4bd834d1ab01000000 starting on node b411128da157f5f64092128c212c0000973bfecd12b3e94b3d648495.
(ProxyActor pid=1450530) INFO 2024-01-11 12:58:33,732 proxy 10.10.29.89 proxy.py:1257 - Starting HTTP server on node: b411128da157f5f64092128c212c0000973bfecd12b3e94b3d648495 listening on port 8000
(ProxyActor pid=1450530) INFO: Started server process [1450530]
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,180 controller 1450442 application_state.py:477 - Built application 'demo_app' successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,182 controller 1450442 deployment_state.py:1379 - Deploying new version of deployment VLLMInference in application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,284 controller 1450442 deployment_state.py:1668 - Adding 1 replica to deployment VLLMInference in application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,302 controller 1450442 application_state.py:477 - Built application 'demo_app2' successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,304 controller 1450442 deployment_state.py:1379 - Deploying new version of deployment VLLMInference in application 'demo_app2'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,406 controller 1450442 deployment_state.py:1668 - Adding 1 replica to deployment VLLMInference in application 'demo_app2'.
(ServeReplica:demo_app:VLLMInference pid=1468450) INFO 2024-01-11 12:58:45,015 VLLMInference demo_app#VLLMInference#WArOfC vllm_serve.py:76 - AsyncEngineArgs(model='facebook/opt-13b', tokenizer='facebook/opt-13b', tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', seed=0, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, revision=None, tokenizer_revision=None, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(ServeReplica:demo_app2:VLLMInference pid=1468458) INFO 2024-01-11 12:58:45,021 VLLMInference demo_app2#VLLMInference#xOjgzS vllm_serve.py:76 - AsyncEngineArgs(model='facebook/opt-13b', tokenizer='facebook/opt-13b', tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', seed=0, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, revision=None, tokenizer_revision=None, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(ServeReplica:demo_app:VLLMInference pid=1468450) SIGTERM handler is not set because current thread is not the main thread.
(ServeReplica:demo_app:VLLMInference pid=1468450) Calling ray.init() again after it has already been called.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:12,292 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:demo_app2:VLLMInference pid=1468458) SIGTERM handler is not set because current thread is not the main thread.
(ServeReplica:demo_app2:VLLMInference pid=1468458) Calling ray.init() again after it has already been called.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:12,494 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:42,363 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:42,566 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 13:00:12,441 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 13:00:12,645 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
Interestingly, when I disable the demo_app2 application within the configuration by commenting it out, the deployment proceeds without any issues.. I have 8 GPUs on my server, so, it should be enough for configuration provided above.
I've also attempted to create my own deployment in Python, bypassing the use of the ray_vllm_inference library, but I encountered the same problems. I noticed that the vLLM application seems to be utilizing the incorrect GPUs. When I logged the CUDA_VISIBLE_DEVICES variable in the initialization function, it displayed 0,1,2,3. However, according to nvidia-smi, vLLM is actually using GPUs 4,5,6,7.
In an attempt to troubleshoot, I created a custom deployment using the SDXL model (also two). This worked perfectly, with the model using the exact GPUs as specified in the CUDA_VISIBLE_DEVICES variable.
I've found the problem, and posted it here