torchchat icon indicating copy to clipboard operation
torchchat copied to clipboard

[FEATURE REQUEST] natively parse 4-bit embedding quantized tensors from GGUF Q4_0 files

Open metascroy opened this issue 9 months ago • 0 comments

In https://github.com/pytorch/torchchat/blob/main/build/gguf_loader.py, we directly convert Q4_0 quantized linear weights to _convert_weight_to_int4pack (our native 4-bit quantization in pytorch). All other tensors are converted to float.

We should be able to directly convert the Q4_0 embedding tensors to our 4-bit embedding quantization too, rather than convert them to float.

metascroy avatar Apr 30 '24 01:04 metascroy