kubernetes-engine-samples icon indicating copy to clipboard operation
kubernetes-engine-samples copied to clipboard

Add RayService vLLM TPU Inference script

Open ryanaoleary opened this issue 5 months ago • 4 comments

Description

This PR adds a simple inference script to be used for a Ray multi-host TPU example serving Meta-Llama-3-70B. Similar to the other scripts in the /llm/ folder, serve_tpu.py builds a serve deployment for vLLM, which can then be queried with text prompts to generate output. This script will be used as part of a tutorial in the GKE and Ray docs.

Tasks

  • [x] The contributing guide has been read and followed.
  • [x] The samples added / modified have been fully tested.
  • [x] Workflow files have been added / modified, if applicable.
  • [x] Region tags have been properly added, if new samples.
  • [x] All dependencies are set to up-to-date versions, as applicable.
  • [ ] Merge this pull-request for me once it is approved.

ryanaoleary avatar Sep 25 '24 00:09 ryanaoleary