speculative decoding complete guide added

Open shirinyamani opened this issue 1 year ago • 0 comments

What does this PR do?

In this PR, I've provided a comprehensive guide on Speculative Decoding techniques aimed at enhancing inference speed. It includes a basic implementation of the core ideas from the paper Accelerating Large Language Model Decoding with Speculative Sampling, applied to GPT-2. Alongside the implementation, I've added detailed explanations of the underlying mathematical principles and an intuitive breakdown of why speculative decoding works.

This PR serves as a complementary extension to the speculation.md file in the TGI repository, offering further practical and theoretical insights into speculative sampling.

Sep 17 '24 00:09 shirinyamani