torchchat
torchchat copied to clipboard
[FEATURE REQUEST] natively parse 4-bit embedding quantized tensors from GGUF Q4_0 files
In https://github.com/pytorch/torchchat/blob/main/build/gguf_loader.py, we directly convert Q4_0 quantized linear weights to _convert_weight_to_int4pack (our native 4-bit quantization in pytorch). All other tensors are converted to float.
We should be able to directly convert the Q4_0 embedding tensors to our 4-bit embedding quantization too, rather than convert them to float.