Eric Buehler

Results 543 comments of Eric Buehler

@lucasavila00, does `llama.cpp` also get a similar T/s to our 549? It seems like dequantizing reduces performance severely, but perhaps it is better for bigger batch sizes?

@lucasavila00, I wonder if it is the `volta` kernels that are slower than `turing`? It seems like we spend ~62% of our time in the sgemm function, but llama.cpp spends...

If I am not mistaken, our completion performance should also be improved by 60% (like prompt perf) because of the new F16 dequant support?

Ah, ok. I'm interested in how our performance compares to llama.cpp in that situation.

Yes, I just need to finish the testing and then I'll merge #234. I am looking forward to Candle adding support for calling hgemm, but if that takes a while...

@lucasavila00 yes, that is possible. Are they timing the memory transfer and sampling?

> It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: [vllm-project/vllm#2188](https://github.com/vllm-project/vllm/pull/2188) Yes, this implementation only checks if the vocabs...

> I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM...

This PR adds the base framework for SD. Further improvements to speed will be added in addition to self-speculative decoding.