executorch icon indicating copy to clipboard operation
executorch copied to clipboard

The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max

Open junpeiz opened this issue 1 year ago • 2 comments

🐛 Describe the bug

In the embedding_4bit implementation here, it assumes the quantized data has quant_min=-8, so it uses weight = weight.view(torch.int8).add(-8) to shift the data.

However, the shifting should be based on the weight_quant_min, which should be weight = weight.view(torch.int8).add(weight_quant_min).

Fixing this also requires the op construction passes the correct min and max. In llama example here, it passes 0 for both weight_quant_min and weight_quant_max, which is incorrect. The whole pipeline works because the weight is quantized by [-8, 7] and the embedding_4bit assumes quant_min=-8. It will fail if the weight is actually quantized by [0, 15].

Versions

Main branch

junpeiz avatar Sep 03 '24 19:09 junpeiz

@jerryzh168 @kimishpatel

JacobSzwejbka avatar Sep 05 '24 16:09 JacobSzwejbka

@metascroy

kimishpatel avatar Oct 01 '24 20:10 kimishpatel

I think @manuelcandales added the 4-bit embedding op, but I agree we should clean them up. Let me create a EE/BE task for this.

metascroy avatar Nov 26 '24 17:11 metascroy