fairseq2 [LayerSkip] Self-Speculative Decoding

[LayerSkip] Self-Speculative Decoding

Open mostafaelhoushi opened this issue 7 months ago • 0 comments

Describe the solution you would like: Implement self-speculative decoding as described in this paper where the earlier layers act as the draft stage and remaining layers act as the verification stage.

Describe the alternatives you have considered: There are different options to implement that:

Implement regular Speculative Decoding where the draft stage is a separate model, and then Self-Speculative Decoding could be implemented by providing a subset of the layers as the draft model (e.g., this implementation)
- If we use this setup, we can add some flags to inform earlier layers if they are running the draft stage or verification stage
Directly implement Self-Speculative Decoding as done here

Additional Context:

Speculative Decoding was first proposed in Fast Inference from Transformers via Speculative Decoding
Another variant of self-speculative decoding where the draft stage is a subset of the layers of the main model is presented in Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Jul 08 '24 17:07 mostafaelhoushi

fairseq2 fairseq2 copied to clipboard

[LayerSkip] Self-Speculative Decoding

fairseq2
fairseq2 copied to clipboard