vllm
vllm copied to clipboard
[Hardware][TPU] Implement tensor parallelism with Ray
This PR implements Ray TPU executor for distributed inference support on TPU.
Work in progress. Three major issues:
- A correctness bug after the model generates a few tokens
- Code duplication between the Ray GPU executor and the Ray TPU executor.
- Performance.
NOTE: This PR was implemented before #5408. Needs to be re-based to reflect the changes.
~~For this PR, I will merge it after getting reviews. :)~~
The changes outside the TPU backend was reviewed in #6812 and #6813.