Daniel Han comments

Results 781 comments of


                                            Daniel Han

Faster Inference & Training Roadmap

@jeromeku For Mistral itself: https://unsloth.ai/blog/mistral-benchmark ![image](https://github.com/unslothai/unsloth/assets/23090290/c7484143-01dd-457c-bedb-f932ae1cd8a3) Gemma's VRAM reduction should be similar to our breakdown for Mistral. For inference for Gemma - I did make it 2x faster, but it's...

Faster Inference & Training Roadmap

@jeromeku That'll be cool!! :) We can collab either via Github or async on our Discord - whatever suites you :)

Faster Inference & Training Roadmap

@jeromeku Oh ye a roadmap would be nice - don't actually have one for inference specifically :)

Faster Inference & Training Roadmap

@jeromeku In terms of inference specifically: 1. GPT Fast 2. Speculative Decoding (use a small model to generate tokens, then use a large model in 1 forward pass and see...

Faster Inference & Training Roadmap

* Oh ye KV cache quant is cool! On issue I have with it is dynamically quantizing the KV cache will cause overhead issues - a super fast method for...

Faster Inference & Training Roadmap

Oh so GEMV is generally OK I guess - the issue is the dequant step merged in (ie what you were doing with GPTQ, except its not matrix matrix mult...

Faster Inference & Training Roadmap

Oh for inference, you method of fusing the dequant step inside the kernel is actually ideal! For training its not, since CUBLAS is relatively smart in data movements. An ideal...

Faster Inference & Training Roadmap

Another approach people do is row wise ![image](https://github.com/unslothai/unsloth/assets/23090290/d2304cb9-9f22-4fef-8063-234a86bd8371) which again can be done in parallel with a reduction as i described above

Faster Inference & Training Roadmap

@jeromeku Extremely sorry on the delay - yep sounds right! :) @nivibilla Yep!

Faster Inference & Training Roadmap

@jeromeku Yes that can be one of the main issues - the other is folding it inside other kernels ie say 1 singular kernel can become too complex to do....