kubernetes-engine-samples icon indicating copy to clipboard operation
kubernetes-engine-samples copied to clipboard

Add RayService vLLM TPU Inference script

Open ryanaoleary opened this issue 1 year ago • 4 comments

Description

This PR adds a simple inference script to be used for a Ray multi-host TPU example serving Meta-Llama-3-70B. Similar to the other scripts in the /llm/ folder, serve_tpu.py builds a serve deployment for vLLM, which can then be queried with text prompts to generate output. This script will be used as part of a tutorial in the GKE and Ray docs.

Tasks

  • [x] The contributing guide has been read and followed.
  • [x] The samples added / modified have been fully tested.
  • [x] Workflow files have been added / modified, if applicable.
  • [x] Region tags have been properly added, if new samples.
  • [x] All dependencies are set to up-to-date versions, as applicable.
  • [ ] Merge this pull-request for me once it is approved.

ryanaoleary avatar Sep 25 '24 00:09 ryanaoleary

Do we need a RayService YAML in the repo with region tags that you can reference in the GCP docs?

andrewsykim avatar Sep 25 '24 15:09 andrewsykim

Here is the summary of changes.

You are about to add 4 region tags.

This comment is generated by snippet-bot. If you find problems with this result, please file an issue at: https://github.com/googleapis/repo-automation-bots/issues. To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • [ ] Refresh this comment

snippet-bot[bot] avatar Sep 27 '24 09:09 snippet-bot[bot]

Do we need a RayService YAML in the repo with region tags that you can reference in the GCP docs?

Yeah that sounds good. I'm still testing out the 405B RayService, but I added the 8B and 70B ones in fe6440c, we can then use envsubst to replace the image var.

ryanaoleary avatar Sep 27 '24 10:09 ryanaoleary

I've tried running LLama-3.1-405B with TPU slice sizes up to 4x4x8 v4 and 8x16 v5e and ran into a few issues:

  1. As slice sizes grow larger, the amount of time needed for vLLM initialization and memory profiling grows incredibly large
  2. Attempting to run inference with smaller topologies than the aforementioned slice sizes leads to errors like RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space hbm. Used 20.44G of 15.75G hbm. Exceeded hbm capacity by 4.70G., since the TPUs only have 16 Gi and 32 Gi for v5e and v4 TPUs respectively. Relatively small HBM capacity (compared to GPUs) means that we need much larger slice sizes to fit the sharded weights.
  3. Even when loading a model that will be sharded (i.e. with tensor-parallelism > 1), vLLM still downloads the entire model on each worker, only afterwards storing the relevant weights in new files to each worker. This means that larger slice sizes will require an extremely high amount of total disk space when loading large models.
  4. Larger multi-host slice sizes lead to ValueError: Too large swap space. errors, where vLLM attempts to allocate more than the total amount of available CPU memory to the swap space. I've gotten around this error by simply setting swap_space=0 in the vLLM EngineArgs, but I'm worried this slows down the model loading.
  5. vLLM lacks support for running multiple multi-host TPU slices (i.e. just with specifying pipeline-parallelism > 1)
  6. vLLM TPU backend lacks support for loading quantized models

If the user has sufficient quota for TPU chips and SSD in their region, a v4 4x4x8 or v5e 8x16 are large enough to run multi-host inference with Llama-3.1-405B. However, I'm wondering whether I'm missing anything obvious here (with the current amount of TPU support in vLLM) that could allow us to a). load the model faster and b). require less disk space when initializing the model.

cc: @richardsliu @andrewsykim

ryanaoleary avatar Oct 02 '24 05:10 ryanaoleary