lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] Speculative Decoding

Open josephrocca opened this issue 1 year ago • 17 comments

Motivation

Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are starting to mature enough to make their way into frameworks like TGI and vLLM, so might be a good time for LMDeploy to consider adding support for a popular/established speculative decoding method.

Related resources

  • TGI (supports Medusa and MLPSpeculator as of writing):
    • https://huggingface.co/docs/text-generation-inference/basic_tutorials/train_medusa
    • https://github.com/huggingface/text-generation-inference/pull/1865
  • vLLM (groundwork for several speculation methods in progress as of writing):
    • https://github.com/vllm-project/vllm/pull/2188
    • https://github.com/vllm-project/vllm/pull/4947
    • https://github.com/vllm-project/vllm/pull/4978
    • https://github.com/vllm-project/vllm/pull/5101
    • https://github.com/vllm-project/vllm/pull/5131
  • MLC-LLM (supports only EAGLE as of writing):
    • https://github.com/mlc-ai/mlc-llm/pull/2080
    • https://github.com/mlc-ai/mlc-llm/pull/2197
    • https://github.com/mlc-ai/mlc-llm/pull/2256
    • https://github.com/mlc-ai/mlc-llm/pull/2266
    • https://github.com/mlc-ai/mlc-llm/pull/2294
    • https://github.com/mlc-ai/mlc-llm/pull/2336

Below is a copy-paste from a neat project called Spec-Bench. The ranking when running 33B models is similar. Please see the linked repo for latest data.

  • Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
  • Testing environment: Pytorch 2.0.1, under CUDA 11.8
  • Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1
Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.44x 1.81x 2.13x 2.11x 2.54x 1.82x 3.57 2.16x
SpS🥈 1.98x 1.37x 2.00x 1.95x 1.89x 1.76x 2.29 1.83x
Hydra🥉 2.04x 1.67x 1.56x 1.81x 2.16x 1.48x 3.26 1.80x
PLD 1.57x 1.07x 2.31x 1.25x 1.62x 1.56x 1.74 1.55x
Medusa 1.60x 1.38x 1.28x 1.46x 1.64x 1.22x 2.32 1.44x
REST 1.49x 1.18x 1.21x 1.46x 1.35x 1.27x 1.63 1.32x
Lookahead 1.13x 0.97x 1.05x 1.07x 1.29x 0.98x 1.65 1.08x

Note that MLPSpeculator is not included in the benchmark since it is newer. Another new method that isn't included in Spec-Bench as of writing:

  • https://github.com/apple/ml-recurrent-drafter

josephrocca avatar Jun 07 '24 20:06 josephrocca

In fact, we have already implemented the Medusa TreeMask version in LMDeploy. When batch=1, the acceleration ratio and RPS improvement relative to the main branch are consistent with those in the blog.

And when the batch size increases, the overhead of Medusa prefill is greater than the benefit of generating multiple tokens at each iteration. We are currently working on solving this problem. Please stay tuned.

zhyncs avatar Jun 08 '24 04:06 zhyncs

EAGLE also has plans to support open source in the future.

zhyncs avatar Jun 08 '24 04:06 zhyncs

@zhyncs I implemented EAGLE in vllm and met the same probelm when the batch size increases. Here is a simple analysis (bs is batch size, k is proposal length, the batch size bottleneck of target model is 3): spec_decode Because the calculation of rejected tokens wastes GPU resources, so skipping speculative decoding is the best choice sometimes.

Meituan's solution introduce a novel sampling mechanism that leverages Thompson Sampling to regulate the generation processes. And someone use a trained control module (I forgot the source).

Or, similar to VLLM's current approach, we can simply skip speculative decoding when the batch size exceeds a certain threshold. It's simple and effective, additional judgment conditions is useful for future enhancements.

RobbieLeung avatar Jun 14 '24 03:06 RobbieLeung

we can simply skip speculative decoding when the batch size exceeds a certain threshold

Thank you for sharing. In fact, this is currently how we do it internally as well, but this approach is still a bit rough. If we want speculative decoding to take effect by default without burdening the user's mind when they are not using it, we also need to dynamically adjust the threshold based on actual workloads, which introduces a certain level of complexity.

In actual usage, the reception rate of Eagle is slightly higher than that of Medusa.

Thompson Sampling Control Mechanism Currently not implemented in actual production environment.

zhyncs avatar Jun 14 '24 03:06 zhyncs

EAGLE also has plans to support open source in the future.

Can you reveal the schedule。Or share the development of the branch together,thanks!!

coolhok avatar Jun 14 '24 09:06 coolhok

In fact, we have already implemented the Medusa TreeMask version in LMDeploy. When batch=1, the acceleration ratio and RPS improvement relative to the main branch are consistent with those in the blog.

And when the batch size increases, the overhead of Medusa prefill is greater than the benefit of generating multiple tokens at each iteration. We are currently working on solving this problem. Please stay tuned.

@zhyncs Hey, could you let me know how things are going right now? Maybe there's something I can do to lend a hand? Appreciate it.

snippetzero avatar Jul 10 '24 12:07 snippetzero

I will split the internal implementation of the TreeMask version into multiple PRs and then submit them.

zhyncs avatar Jul 10 '24 12:07 zhyncs

I will split the internal implementation of the TreeMask version into multiple PRs and then submit them.

Thank you, could you share the methods to solve the performance degradation after the batch size increases

snippetzero avatar Jul 10 '24 12:07 snippetzero

The overall design and detailed implementation have been discussed with @lzhangzz before, and there was improvement in small batch sizes, but it didn't work well in large batch sizes. As far as I know, the performance achieved on vLLM is also similar.

zhyncs avatar Jul 10 '24 12:07 zhyncs

EAGLE has a higher computational load than Medusa, but it has a higher acceptance rate. It performs better in large batches compared to Medusa. However, this is just a temporary solution. The reason this approach works, which involves trading more computation for reduced latency, is because in small batches, computational resources are not fully utilized.

zhyncs avatar Jul 10 '24 12:07 zhyncs

EAGLE has a higher computational load than Medusa, but it has a higher acceptance rate. It performs better in large batches compared to Medusa. However, this is just a temporary solution. The reason this approach works, which involves trading more computation for reduced latency, is because in small batches, computational resources are not fully utilized.

how is the attention kernel chosen during the verification stage? As mentioned in the FlashInfer blog, the computational intensity of the append/verification stage is between that of decode and prefill. It doesn't seem optimal to use either the decode or prefill kernel from the lmdeploy engine directly. It should be noted that when Q is relatively long, using the current prefill kernel for verification might not be the optimal approach

snippetzero avatar Jul 11 '24 03:07 snippetzero

how is the attention kernel chosen during the verification stage?

@snippetzero The current implementation uses the prefill kernel in TurboMind. cc @lzhangzz

zhyncs avatar Jul 11 '24 03:07 zhyncs

In fact, we have already implemented the Medusa TreeMask version in LMDeploy. When batch=1, the acceleration ratio and RPS improvement relative to the main branch are consistent with those in the blog.

And when the batch size increases, the overhead of Medusa prefill is greater than the benefit of generating multiple tokens at each iteration. We are currently working on solving this problem. Please stay tuned.

Hi, It seems that I can not find any code related to Speculative Decoding in LMdeploy. Has it not to been pushed in the sources? If done, could you provide me commit id or simply some key words?

GxjGit avatar Aug 08 '24 01:08 GxjGit

Has it not to been pushed in the sources?

I don't think it's available yet, and I'm not sure if this can be prioritised right now. Maybe @lvhan028 can comment? It certainly seems very exciting based on vLLM's findings:

  • https://github.com/vllm-project/vllm/issues/4630

In the Anyscale fork we saw a 50% speedup on bs=8 with a 68m-sized draft model on TP1/70B target model on TP8 and a 7B draft model on TP(1|8)/70B target model on TP8. This was with the optimizations listed above as "P0".

and also based on together.ai's findings:

  • https://www.together.ai/blog/speculative-decoding-for-high-throughput-long-context-inference

Conventional wisdom (e.g., Chen et al., 2023; Li et al., 2024; Liu et al., 2024) is that in the high-throughput regime (i.e., large batch sizes), speculative decoding—which leverages underutilized GPU compute during memory-bound decoding—does not make sense, because decoding will be compute-bound and the GPUs will thus be fully utilized. Surprisingly, we show analytically and empirically that for large batch sizes, if the input sequences are long enough, decoding once again becomes memory-bound due to the large size of the KV cache. Building on this key observation, we demonstrate that speculative decoding can improve throughput and latency by up to 2x on 8 A100s in this large-batch, long-context setting.

josephrocca avatar Sep 11 '24 02:09 josephrocca

No, it hasn't

lvhan028 avatar Sep 11 '24 03:09 lvhan028

I know there are a few people monitoring this, so I just want to make sure that lvhan028's response is not interpreted as lack of interest in this feature. The LMDeploy team is interested in implementing speculative decoding! Do not lose faith.

https://github.com/InternLM/lmdeploy/issues/2470#issuecomment-2362591828

As for speculative decoding, it is in our scope. Stay tuned.

Very exciting! Especially if compatible with the other key features (AWQ, prefix cache, quantized KV cache). I will be patient.

josephrocca avatar Sep 20 '24 03:09 josephrocca

Some more/newer references relevant to this feature request:

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

  • https://arxiv.org/abs/2406.16858
  • https://github.com/SafeAILab/EAGLE (main branch of repo is EAGLE-2)
  • "EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance."
  • Speedup ratios of 3.05x to 4.26x according to authors.
  • 20%-40% faster than EAGLE-1 according to authors.
  • "EAGLE-3 is on the way. Stay tuned!" https://x.com/hongyangzh/status/1837394501070962887

image

DISCO: DynamIc SpeCulation lookahead Optimization

  • https://arxiv.org/abs/2405.04304
  • https://huggingface.co/blog/dynamic_speculation_lookahead
  • Has been made the default operational mode for assisted generation starting from HF Transformers release 4.45.0
  • Up to 2.7x faster, depending on the task, according to authors.

Note: I haven't read into the details, but of course I suspect these reported speedup ratios are under very ideal circumstances in terms of compute density / model size relative to GPU hardware, and other factors, but even if it were to only give a 1.2x speedup (for example) in real-world circumstances, that would be very useful!

josephrocca avatar Oct 10 '24 12:10 josephrocca

Some more/newer references relevant to this feature request:

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

  • https://arxiv.org/abs/2406.16858
  • https://github.com/SafeAILab/EAGLE (main branch of repo is EAGLE-2)
  • "EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance."
  • Speedup ratios of 3.05x to 4.26x according to authors.
  • 20%-40% faster than EAGLE-1 according to authors.
  • "EAGLE-3 is on the way. Stay tuned!" https://x.com/hongyangzh/status/1837394501070962887

image

DISCO: DynamIc SpeCulation lookahead Optimization

  • https://arxiv.org/abs/2405.04304
  • https://huggingface.co/blog/dynamic_speculation_lookahead
  • Has been made the default operational mode for assisted generation starting from HF Transformers release 4.45.0
  • Up to 2.7x faster, depending on the task, according to authors.

Note: I haven't read into the details, but of course I suspect these reported speedup ratios are under very ideal circumstances in terms of compute density / model size relative to GPU hardware, and other factors, but even if it were to only give a 1.2x speedup (for example) in real-world circumstances, that would be very useful!

Awesome

Alwin4Zhang avatar Oct 12 '24 00:10 Alwin4Zhang