Add RayService vLLM TPU Inference script

Open ryanaoleary opened this issue 5 months ago • 4 comments

Description

This PR adds a simple inference script to be used for a Ray multi-host TPU example serving Meta-Llama-3-70B. Similar to the other scripts in the /llm/ folder, serve_tpu.py builds a serve deployment for vLLM, which can then be queried with text prompts to generate output. This script will be used as part of a tutorial in the GKE and Ray docs.

Tasks

[x] The contributing guide has been read and followed.
[x] The samples added / modified have been fully tested.
[x] Workflow files have been added / modified, if applicable.
[x] Region tags have been properly added, if new samples.
[x] All dependencies are set to up-to-date versions, as applicable.
[ ] Merge this pull-request for me once it is approved.

Sep 25 '24 00:09 ryanaoleary

kubernetes-engine-samples kubernetes-engine-samples copied to clipboard

Add RayService vLLM TPU Inference script

Description

Tasks

kubernetes-engine-samples
kubernetes-engine-samples copied to clipboard