ray-llm
ray-llm copied to clipboard
Running ray-llm 0.5.0 on g4dn.12xlarge instance
Has anyone had any success serving llms through the 0.5.0 docker image?
I have followed the following steps:
cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}
docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=/tmp/data -v $cache_dir:/home/user/data anyscale/ray-llm:0.5.0 bash
I have reconfigured the .yaml with
accelerator_type_T4
ray start --head --dashboard-host=0.0.0.0 --num-cpus 48 --num-gpus 4 --resources{"accelerator_type_T4": 4}'
serve run ~/serve_configs/amazon--LightGPT.yaml
It runs, but I get a
"Deployment 'VLLMDeployment: amazon--LightGPT' in application 'ray-llm' has 2 replicas that have taken more than 30s to initialize. This may be caused by a slow init or reconfigure method.
From here, nothing happens. I've let it run for up to a couple of hours, it just seems to hang up here.
Any success working around these issues?
I'm using vllm as the serving vllm serving And run inference using Ray serve, here is a sample script: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
And just make it as a Ray serve like:
@serve.deployment # (num_replicas=1 ,ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class VLLMPredictDeployment():
def __init__(self, **kwargs):
What does ray status say?