mlx Support quantization for non-multiples of 32.

Yayi 30b k,v layer has [input_dims=7168, out_dims=112], so it failed to quantize due to error all dimensions should be divisible by 32 for now. FYI, here is the implementation of the yayi2 30b model's k,v layer -> https://huggingface.co/wenge-research/yayi2-30b/blob/main/modeling_yayi.py#L180

Dec 31 '23 14:12 mzbac

Yes that's a known limitation of our quantization at the moment. Will mark this as an ehnancment.

In the meantime one thing you could do as a workaround so you aren't blocked by this is to concatenate the K and V projections and do them as a single matmul (112 * 2 / 32 = 7).

Dec 31 '23 14:12 awni

Yes that's a known limitation of our quantization at the moment. Will mark this as an ehnancment.

In the meantime one thing you could do as a workaround so you aren't blocked by this is to concatenate the K and V projections and do them as a single matmul (112 * 2 / 32 = 7).

Hi Awni, thanks for the reply. Would you be able to give a bit more detail about the workaround? By the way, happy new year! :)

Dec 31 '23 15:12 mzbac

Yes so there are two projections, one for the keys and one for the values.

k = x @ Wk.T
v = x @ Wv.T

Instead you could do something like:

k, v = mx.split(x @ mx.concatenate([Wk, Wv], axis=0).T, 2)

Now you can pre compute the concatenated matrix and quantize it since its dimensions should be divisible by 32.

Dec 31 '23 16:12 awni

Thanks @awni , this workaround works. I have added it to the yayi2 example. :)

Jan 01 '24 03:01 mzbac

Nice.

I think (at least for the non-group axis) fixing this is mostly a matter of bounds checking in the matmul.. but @angeloskath can say more about if / when we could support more flexible quantization.

Jan 01 '24 20:01 awni

Aslo https://github.com/ml-explore/mlx-examples/issues/279

Jan 10 '24 15:01 awni

@mzbac not sure you saw, in 0.0.10 this should be fixed. We can add yayi2 to mlx-lm now 😃

Jan 19 '24 14:01 awni

@mzbac not sure you saw, in 0.0.10 this should be fixed. We can add yayi2 to mlx-lm now 😃

Nice, I will take a look. Hopefully, it just needs to be mapped to the llama arch and it will work.

Jan 19 '24 14:01 mzbac

Oh that would be amazing.

Jan 19 '24 14:01 awni

Please close this now as MLX supports non-32 dims quantization and I have tested it on the yayi2 model, and it works as expected.

Jan 26 '24 03:01 mzbac