kubernetes-engine-samples
kubernetes-engine-samples copied to clipboard
Add RayService vLLM TPU Inference script
Description
This PR adds a simple inference script to be used for a Ray multi-host TPU example serving Meta-Llama-3-70B. Similar to the other scripts in the /llm/ folder, serve_tpu.py
builds a serve deployment for vLLM, which can then be queried with text prompts to generate output. This script will be used as part of a tutorial in the GKE and Ray docs.
Tasks
- [x] The contributing guide has been read and followed.
- [x] The samples added / modified have been fully tested.
- [x] Workflow files have been added / modified, if applicable.
- [x] Region tags have been properly added, if new samples.
- [x] All dependencies are set to up-to-date versions, as applicable.
- [ ] Merge this pull-request for me once it is approved.