executorch
executorch copied to clipboard
The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max
🐛 Describe the bug
In the embedding_4bit
implementation here, it assumes the quantized data has quant_min=-8
, so it uses weight = weight.view(torch.int8).add(-8)
to shift the data.
However, the shifting should be based on the weight_quant_min
, which should be weight = weight.view(torch.int8).add(weight_quant_min)
.
Fixing this also requires the op construction passes the correct min and max. In llama example here, it passes 0
for both weight_quant_min
and weight_quant_max
, which is incorrect. The whole pipeline works because the weight is quantized by [-8, 7]
and the embedding_4bit
assumes quant_min=-8
. It will fail if the weight is actually quantized by [0, 15]
.
Versions
Main branch