auto-round
auto-round copied to clipboard
OPT model quantize_lm_head clarification
While testing for OPT with quant_lm_head=True
, here are the result weights post quantize:
weight keys: ['lm_head.g_idx', 'lm_head.qweight', 'lm_head.qzeros', 'lm_head.scales', 'model.decoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', ...
model.decoder.embed_tokens.weight
is not quantized but lm_head
is. Unforutnately vllm model code and maybe hf transformer also ignores this lm_head layer in weight load? I confirmed this for vllm but not 100% sure for transformer.
But opt's lm_head is actually the same as (soft lnked) model.decoder.embed_tokens
in code in vllm and appears to be true in transformers as well. Checked original weights and lm_head exists in weights but size/values exactly same as embed_tokens so model coders appears to think lm_head should be ignored on load.
https://github.com/huggingface/transformers/blob/0ae789e04330e15a90e34cd723c851a8ab8d7ec5/src/transformers/models/opt/modeling_opt.py#L1001
In vllm's model loading code for OPT, the lm_head
weights are skipped and soft-linked to embeddings. This appears to be the same for hf transformers as well.
https://github.com/vllm-project/vllm/blob/26f2fb51133c85ad8a57a87c8037f750dda757f4/vllm/model_executor/models/opt.py#L288
So my naive question is who is correct? Autoround correctly finding and quantizing the lm_head
layer but this layer is actually ignored by model loaders? ={
This is relation to the testing I am doing for vllm PR: https://github.com/vllm-project/vllm/pull/4442#issuecomment-2085491133
This becomes an issue loading the quant as vllm and completedly skip lm_head
layers (pre or post-quant) since I assume the model code writer assumed why load the same equivalent weights twice when tensor size and values are exactly the same.
I am new to all the layers/modules so forgive me if my question itself is based on false premises. Thank you! I hope to have intel/autoround model support merged into vllm soon.
Here is the original weights before quantization:
https://huggingface.co/facebook/opt-125m
key model.decoder.embed_tokens.weight torch.Size([50272, 768])
key lm_head.weight torch.Size([50272, 768])
So in original OPT-127M model weights, the model.decoder.embed_tokens.weight
and lm_head.weight
both exists and size and even values of all tensors are exactly the same!
@robertgshaw2-neuralmagic Is this a bug in vllm OPT model code? Why is it skipping lm_head
layer when it actually exists (even though it is an duplicate of embed_tokens)?
@wenhuach21 @WeiweiZhang1