vllm [Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model

Open cadedaniel opened this issue 1 year ago • 1 comments

Overview

Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger LLM. Papers report 2-3x speedup for bs=1, in Anyscale's fork we see up to 2x speedup with a small draft model for bs=8 (30% for bs=16) (we can improve this! see https://github.com/vllm-project/vllm/issues/4630 if you want to help).

A key optimization for small models (68m/160m domain) is to use tensor-parallel degree 1, even if the target model is using tensor-parallel degree 4 or 8. In our fork, this reduces proposal time from 5ms/tok to 1.5ms/tok. This will allow a well-aligned 68m draft model to get 2x per-user throughput improvement on 70B target model.

Furthermore, a 1B/7B proposer model may ideally be placed on TP=2 or TP=4, while the larger model is placed on TP=8. vLLM should support these configuration so the community can use the configuration best for their draft model.

Design suggestions

I implemented a Worker which patches the tensor parallel group to TP1 in our fork. The code is dumped here. We should use this approach in vLLM, however we can improve it by using @youkaichao 's tensor-parallel group improvements.

May 06 '24 18:05 cadedaniel

I can work on this after a major refactor of distributed https://github.com/vllm-project/vllm/pull/4591 is landed.

May 06 '24 18:05 youkaichao

@cadedaniel Can I contribute my code that already implemented this feature on v0.4.2? I've referred to your code in #2188.

I'm aware that #4933 is going on, so I want to confirm that it's okay to do it.

Jun 05 '24 07:06 wooyeonlee0

@wooyeonlee0 pls go ahead.

Jun 05 '24 22:06 GeauxEric

yep, my policy is to review the PRs in the order that they're initially ready for review. go ahead @wooyeonlee0 .

Jun 06 '24 23:06 cadedaniel

Thanks for the answer :) I'll send a PR maybe next week.

Jun 07 '24 01:06 wooyeonlee0

vllm vllm copied to clipboard

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model

Overview

Design suggestions

vllm
vllm copied to clipboard