vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Hardware][TPU] Implement tensor parallelism with Ray

Open WoosukKwon opened this issue 1 year ago • 1 comments

This PR implements Ray TPU executor for distributed inference support on TPU.

Work in progress. Three major issues:

  • A correctness bug after the model generates a few tokens
  • Code duplication between the Ray GPU executor and the Ray TPU executor.
  • Performance.

NOTE: This PR was implemented before #5408. Needs to be re-based to reflect the changes.

WoosukKwon avatar Jun 26 '24 20:06 WoosukKwon

~~For this PR, I will merge it after getting reviews. :)~~

The changes outside the TPU backend was reviewed in #6812 and #6813.

WoosukKwon avatar Jun 26 '24 21:06 WoosukKwon