Slower than llama2.c?
I tried running the example (stories42M/stories15M), comparing timing against the original llama2.c (tok/sec), and this variant runs slower.
Is that to be expected?
Yes, that's expected. This implementation currently runs on the CPU, so it doesn't leverage the GPU acceleration used in the original implementation. The speculative decoding process involves the parallel verification of draft tokens—a step that significantly benefits from GPU parallelism. Without GPU support, token generation is naturally slower. I may work on adding GPU support in the future to improve performance. In the meantime, consider using this code as supplementary material to better understand the SD algorithm.
I may work on adding GPU support in the future to improve performance.
Also, contributions are always welcome if anyone is interested. 😄