exo icon indicating copy to clipboard operation
exo copied to clipboard

Tinygrad Quantization Support [WIP]

Open KhanerX opened this issue 11 months ago • 3 comments

What I did:

  1. Define custom layers for Affine Quantized models, including integer weights, float16 scales and biases (zero point correction)
  2. Load MLX-Community quantized model and unpack the weights.
  3. Write the forward logic for quantized layers, following this paper (see section 2.3)

Todo:

  • [ ] write tests, test with multiple nodes and different llama models
  • [x] support 4-bit quantization
  • [ ] do the forward math in Integer (see section 2.2 of mentioned paper)

With this first commit, you can run exo --run-model="llama-3.2-1b-8bit" with tinygrad backend and "mlx-community" model.

KhanerX avatar Jan 25 '25 15:01 KhanerX

Also, I'm doing math in float32 right now, which adds overhead. when I change it to float16, I think something overflows and model outputs nothing. I will fix this.

KhanerX avatar Jan 25 '25 15:01 KhanerX

This is a great start - I tested this and it works. That's awesome because that means we can support any mlx model in tinygrad.

Are you sending parameters in float32 to the GPU? Or are they being sent in fp8? Just wondering what kind of speed to expect here. Wonder how close this gets to MLX quantized

AlexCheema avatar Jan 26 '25 20:01 AlexCheema

tested it out on m3 pro for mlx-community/Llama-3.2-1B-Instruct-8bit

old pr: Screenshot 2025-02-05 at 3 35 45 PM

new_pr: Screenshot 2025-02-05 at 3 23 48 PM

mlx: Screenshot 2025-02-05 at 3 25 18 PM

just 2.5x slower now 🚀🚀

@KhanerX @AlexCheema

varshith15 avatar Feb 05 '25 10:02 varshith15