ray_vllm_inference
ray_vllm_inference copied to clipboard
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
So i've got a kubernetes cluster, installed with KubeRay Operator and created a RayCluster, but how do i then create a manifest for Rayserve to serve, say a Llama2 or...
Your tutorial uses an online model download, but some environments are offline. Is there a deployment guide for offline LLM?
This is my .yaml configuration file: ```yaml # Serve config file # # For documentation see: # https://docs.ray.io/en/latest/serve/production-guide/config.html host: 0.0.0.0 port: 8000 applications: - name: demo_app route_prefix: /a import_path: ray_vllm_inference.vllm_serve:deployment...
I want to run a model service on multiple GPUs without TP。