CTranslate2 Support Speculative Decoding

Support Speculative Decoding

Open JOHW85 opened this issue 1 year ago • 5 comments

This could be used for LLMs and hopefully for encoder-decoder models like using the smaller NLLB coupled with the bigger NLLB models

Sep 12 '23 11:09 JOHW85

This looks be a duplicate of #1234

Sep 12 '23 16:09 wsxiaoys

It's the same idea but I'm not sure it refers to the same implementation? There is also "Speculative sampling" which seem to refer to yet another implementation/algorithm of this concept.

Sep 14 '23 08:09 guillaumekln

How hard would it be to implement a really naive version of this with ctranslate2? I would like to pick this up if possible

Sep 15 '23 03:09 epinnock

Implementing this feature in the most basic form may be already possible with the existing Generator API. You could use generate_batch with a small model, and then use forward_batch with a big model to validate the output. The limitation of this approach is that when the big model does not agree, you have to start the generation from scratch and not at the first mismatched position.

Sep 15 '23 08:09 guillaumekln

CTranslate2 CTranslate2 copied to clipboard

Support Speculative Decoding

CTranslate2
CTranslate2 copied to clipboard