ray-llm
ray-llm copied to clipboard
Deploying RayLLM locally failed with exit code 0 even if deployment is ready
Hi, I'm trying to deploy meta-llama--Llama-2-7b-chat-hf.yaml
using the instruction provided in the README. The deployment seems to work but just when everything is about to ready, it just exit without any error:
(base) ray@35cf69569a48:~/models/continuous_batching$ aviary run --model ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
[WARNING 2023-10-16 09:04:22,790] api.py: 382 DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead.
[INFO 2023-10-16 09:04:24,848] accelerator.py: 171 Failed to detect number of TPUs: [Errno 2] No such file or directory: '/dev/vfio'
2023-10-16 09:04:24,987 INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[INFO 2023-10-16 09:04:26,208] api.py: 148 Nothing to shut down. There's no Serve application running on this Ray cluster.
[INFO 2023-10-16 09:04:26,269] deployment_base_client.py: 28 Initialized with base handles {'meta-llama/Llama-2-7b-chat-hf': <ray.serve.deployment.Application object at 0x7f1a8e5a94c0>}
/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/api.py:519: UserWarning: Specifying host and port in `serve.run` is deprecated and will be removed in a future version. To specify custom HTTP options, use `serve.start`.
warnings.warn(
(HTTPProxyActor pid=22159) INFO 2023-10-16 09:04:28,523 http_proxy 172.17.0.2 http_proxy.py:1428 - Proxy actor 69fb321f9360031e80d6562c01000000 starting on node 82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25.
[INFO 2023-10-16 09:04:28,555] api.py: 328 Started detached Serve instance in namespace "serve".
(HTTPProxyActor pid=22159) INFO 2023-10-16 09:04:28,530 http_proxy 172.17.0.2 http_proxy.py:1612 - Starting HTTP server on node: 82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25 listening on port 8000
(HTTPProxyActor pid=22159) INFO: Started server process [22159]
(ServeController pid=22117) INFO 2023-10-16 09:04:28,689 controller 22117 deployment_state.py:1390 - Deploying new version of deployment VLLMDeployment:meta-llama--Llama-2-7b-chat-hf in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,690 controller 22117 deployment_state.py:1390 - Deploying new version of deployment Router in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,793 controller 22117 deployment_state.py:1679 - Adding 1 replica to deployment VLLMDeployment:meta-llama--Llama-2-7b-chat-hf in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,796 controller 22117 deployment_state.py:1679 - Adding 2 replicas to deployment Router in application 'router'.
(ServeReplica:router:Router pid=22202) [WARNING 2023-10-16 09:04:32,739] api.py: 382 DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,808] vllm_models.py: 201 Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f10b18d9040> PlacementGroupID(371dfe1112ca6705f22ac50c828201000000). {'placement_group_id': '371dfe1112ca6705f22ac50c828201000000', 'name': 'SERVE_REPLICA::router#VLLMDeployment:meta-llama--Llama-2-7b-chat-hf#mZlJZj', 'bundles': {0: {'CPU': 1.0}, 1: {'CPU': 4.0, 'GPU': 1.0}}, 'bundles_to_node_id': {0: '82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25', 1: '82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25'}, 'strategy': 'STRICT_PACK', 'state': 'CREATED', 'stats': {'end_to_end_creation_latency_ms': 1.814, 'scheduling_latency_ms': 1.728, 'scheduling_attempt': 1, 'highest_retry_delay_ms': 0.0, 'scheduling_state': 'FINISHED'}}
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,809] vllm_models.py: 204 Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f10b18d9040>
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,809] vllm_node_initializer.py: 38 Starting initialize_node tasks on the workers and local node...
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:37,474] vllm_node_initializer.py: 53 Finished initialize_node tasks.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:37 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-ll
ama/Llama-2-7b-chat-hf', tokenizer_mode=auto, revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:37 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the init
ialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:51 llm_engine.py:205] # GPU blocks: 1014, # CPU blocks: 512
[INFO 2023-10-16 09:04:53,741] client.py: 581 Deployment 'VLLMDeployment:meta-llama--Llama-2-7b-chat-hf:biUfsX' is ready. component=serve deployment=VLLMDeployment:meta-llama--Llama-2-7b-chat-hf
[INFO 2023-10-16 09:04:53,741] client.py: 581 Deployment 'Router:QHkGZE' is ready at `http://0.0.0.0:8000/`. component=serve deployment=Router
(pid=22359) [WARNING 2023-10-16 09:04:37,030] api.py: 382 DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` inst
ead. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for
more options.)
(base) ray@35cf69569a48:~/models/continuous_batching$ echo $?
0
Noted that I tried to modify the config so it can run on my custom machine with 8 CPU core, 32 GB of RAM and NVIDIA L4 GPU:
(base) ray@35cf69569a48:~/models/continuous_batching$ cat ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
deployment_config:
autoscaling_config:
min_replicas: 1
initial_replicas: 1
max_replicas: 1
target_num_ongoing_requests_per_replica: 24
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 0.5
downscale_delay_s: 300.0
upscale_delay_s: 15.0
max_concurrent_queries: 64
ray_actor_options:
resources:
accelerator_type_a10: 0
engine_config:
model_id: meta-llama/Llama-2-7b-chat-hf
hf_model_id: meta-llama/Llama-2-7b-chat-hf
type: VLLMEngine
engine_kwargs:
trust_remote_code: true
max_num_batched_tokens: 4096
max_num_seqs: 64
gpu_memory_utilization: 0.95
max_total_tokens: 4096
generation:
prompt_format:
system: "<<SYS>>\n{instruction}\n<</SYS>>\n\n"
assistant: " {instruction} </s><s> "
trailing_assistant: " "
user: "[INST] {system}{instruction} [/INST]"
system_in_user: true
default_system_message: ""
stopping_sequences: ["<unk>"]
scaling_config:
num_workers: 1
num_gpus_per_worker: 1
num_cpus_per_worker: 4
placement_strategy: "STRICT_PACK"
resources_per_worker:
accelerator_type_a10: 0
I also confirmed that my machine can run meta-llama/Llama-2-7b-chat-hf
using pure vllm, and the RayLLM seems to confirm that the model can be loaded, so why does it keep exiting ? Am I doing anything wrong here ?
Thank you for checking by
hi @lamhoangtung can you try using the serve run command instead. You can refer to the readme here for example usage - https://github.com/ray-project/ray-llm