CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Support Speculative Decoding

Open JOHW85 opened this issue 1 year ago • 5 comments

This could be used for LLMs and hopefully for encoder-decoder models like using the smaller NLLB coupled with the bigger NLLB models

JOHW85 avatar Sep 12 '23 11:09 JOHW85

This looks be a duplicate of #1234

wsxiaoys avatar Sep 12 '23 16:09 wsxiaoys

It's the same idea but I'm not sure it refers to the same implementation? There is also "Speculative sampling" which seem to refer to yet another implementation/algorithm of this concept.

guillaumekln avatar Sep 14 '23 08:09 guillaumekln

How hard would it be to implement a really naive version of this with ctranslate2? I would like to pick this up if possible

epinnock avatar Sep 15 '23 03:09 epinnock

Implementing this feature in the most basic form may be already possible with the existing Generator API. You could use generate_batch with a small model, and then use forward_batch with a big model to validate the output. The limitation of this approach is that when the big model does not agree, you have to start the generation from scratch and not at the first mismatched position.

guillaumekln avatar Sep 15 '23 08:09 guillaumekln