fairseq2
fairseq2 copied to clipboard
[LayerSkip] Self-Speculative Decoding
Describe the solution you would like: Implement self-speculative decoding as described in this paper where the earlier layers act as the draft stage and remaining layers act as the verification stage.
Describe the alternatives you have considered: There are different options to implement that:
- Implement regular Speculative Decoding where the draft stage is a separate model, and then Self-Speculative Decoding could be implemented by providing a subset of the layers as the draft model (e.g., this implementation)
- If we use this setup, we can add some flags to inform earlier layers if they are running the draft stage or verification stage
- Directly implement Self-Speculative Decoding as done here
Additional Context:
- Speculative Decoding was first proposed in Fast Inference from Transformers via Speculative Decoding
- Another variant of self-speculative decoding where the draft stage is a subset of the layers of the main model is presented in Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding