CTranslate2
CTranslate2 copied to clipboard
Support Speculative Decoding
This could be used for LLMs and hopefully for encoder-decoder models like using the smaller NLLB coupled with the bigger NLLB models
This looks be a duplicate of #1234
It's the same idea but I'm not sure it refers to the same implementation? There is also "Speculative sampling" which seem to refer to yet another implementation/algorithm of this concept.
How hard would it be to implement a really naive version of this with ctranslate2? I would like to pick this up if possible
Implementing this feature in the most basic form may be already possible with the existing Generator API. You could use generate_batch
with a small model, and then use forward_batch
with a big model to validate the output. The limitation of this approach is that when the big model does not agree, you have to start the generation from scratch and not at the first mismatched position.