luminal icon indicating copy to clipboard operation
luminal copied to clipboard

[feature suggestion] self speculative decoding

Open NewBornRustacean opened this issue 9 months ago • 7 comments

Good morning(or afternoon/ evening)!

There is a methodology called self speculative decoding among the techniques to enhance the speed of LLM inference. Would it be possible to implement this feature in Luminal? If it aligns with Luminal's philosophy, I believe this type of work could greatly contribute to speed improvement! Even though it's not included in the v0.3 roadmap, I'd like to start this task slowly if it's alright.

Summary of abstract This paper introduces self-speculative decoding, a novel inference scheme designed to accelerate Large Language Models (LLMs) without relying on auxiliary models. It operates in two stages: drafting, which quickly generates draft tokens by selectively skipping intermediate layers, and verification, which validates draft output using the original LLM in a single forward pass. The approach maintains output quality identical to that of unaltered LLMs, without requiring additional neural network training or extra memory, offering a plug-and-play and cost-effective solution for inference acceleration, with benchmarks showing speedups of up to 1.73× on LLaMA-2 and its fine-tuned models.

NewBornRustacean avatar May 08 '24 04:05 NewBornRustacean