executorch
executorch copied to clipboard
The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max
🐛 Describe the bug
In the embedding_4bit implementation here, it assumes the quantized data has quant_min=-8, so it uses weight = weight.view(torch.int8).add(-8) to shift the data.
However, the shifting should be based on the weight_quant_min, which should be weight = weight.view(torch.int8).add(weight_quant_min).
Fixing this also requires the op construction passes the correct min and max. In llama example here, it passes 0 for both weight_quant_min and weight_quant_max, which is incorrect. The whole pipeline works because the weight is quantized by [-8, 7] and the embedding_4bit assumes quant_min=-8. It will fail if the weight is actually quantized by [0, 15].
Versions
Main branch
@jerryzh168 @kimishpatel
@metascroy
I think @manuelcandales added the 4-bit embedding op, but I agree we should clean them up. Let me create a EE/BE task for this.