Daniel Han

Results 781 comments of Daniel Han

@jeromeku For Mistral itself: https://unsloth.ai/blog/mistral-benchmark ![image](https://github.com/unslothai/unsloth/assets/23090290/c7484143-01dd-457c-bedb-f932ae1cd8a3) Gemma's VRAM reduction should be similar to our breakdown for Mistral. For inference for Gemma - I did make it 2x faster, but it's...

@jeromeku That'll be cool!! :) We can collab either via Github or async on our Discord - whatever suites you :)

@jeromeku Oh ye a roadmap would be nice - don't actually have one for inference specifically :)

@jeromeku In terms of inference specifically: 1. GPT Fast 2. Speculative Decoding (use a small model to generate tokens, then use a large model in 1 forward pass and see...

* Oh ye KV cache quant is cool! On issue I have with it is dynamically quantizing the KV cache will cause overhead issues - a super fast method for...

Oh so GEMV is generally OK I guess - the issue is the dequant step merged in (ie what you were doing with GPTQ, except its not matrix matrix mult...

Oh for inference, you method of fusing the dequant step inside the kernel is actually ideal! For training its not, since CUBLAS is relatively smart in data movements. An ideal...

Another approach people do is row wise ![image](https://github.com/unslothai/unsloth/assets/23090290/d2304cb9-9f22-4fef-8063-234a86bd8371) which again can be done in parallel with a reduction as i described above

@jeromeku Extremely sorry on the delay - yep sounds right! :) @nivibilla Yep!

@jeromeku Yes that can be one of the main issues - the other is folding it inside other kernels ie say 1 singular kernel can become too complex to do....