speculative_decoding.c icon indicating copy to clipboard operation
speculative_decoding.c copied to clipboard

Slower than llama2.c?

Open aboeing opened this issue 10 months ago • 2 comments

I tried running the example (stories42M/stories15M), comparing timing against the original llama2.c (tok/sec), and this variant runs slower.

Is that to be expected?

aboeing avatar Mar 06 '25 02:03 aboeing

Yes, that's expected. This implementation currently runs on the CPU, so it doesn't leverage the GPU acceleration used in the original implementation. The speculative decoding process involves the parallel verification of draft tokens—a step that significantly benefits from GPU parallelism. Without GPU support, token generation is naturally slower. I may work on adding GPU support in the future to improve performance. In the meantime, consider using this code as supplementary material to better understand the SD algorithm.

mscheong01 avatar Mar 10 '25 05:03 mscheong01

I may work on adding GPU support in the future to improve performance.

Also, contributions are always welcome if anyone is interested. 😄

mscheong01 avatar Mar 10 '25 05:03 mscheong01