executorch The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant

The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max

Open junpeiz opened this issue 5 months ago • 2 comments

🐛 Describe the bug

In the embedding_4bit implementation here, it assumes the quantized data has quant_min=-8, so it uses weight = weight.view(torch.int8).add(-8) to shift the data.

However, the shifting should be based on the weight_quant_min, which should be weight = weight.view(torch.int8).add(weight_quant_min).

Fixing this also requires the op construction passes the correct min and max. In llama example here, it passes 0 for both weight_quant_min and weight_quant_max, which is incorrect. The whole pipeline works because the weight is quantized by [-8, 7] and the embedding_4bit assumes quant_min=-8. It will fail if the weight is actually quantized by [0, 15].

Versions

Main branch

Sep 03 '24 19:09 junpeiz

executorch executorch copied to clipboard

The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max

🐛 Describe the bug

Versions

executorch
executorch copied to clipboard