mlx-examples `lm_head` is quantized in the `hf

When I playing around with qlora example using mlx-community's 4-bit models, I encountered an issue while trying to merge the lora weight back to the original model. I had to dequantize the lm_header because the merged model would be a non-quantized model. However, I found that dequantizing the lm_header caused a significant performance degradation for the model.

I'm wondering if we should just disable quantization in hf_llm so that later in lora, we can directly use the model converted by hf_llm without this kind of issue?

edit: check on Transformers bnb integration. It seems like they also skip the lm_head by default -> https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/bitsandbytes.py#L234

Jan 07 '24 13:01 mzbac

I'm wondering if we should just disable quantization in hf_llm so that later in lora, we can directly use the model converted by hf_llm without this kind of issue?

I'm wondering the opposite which is that maybe we should keep the LM head quantized and see if QLoRA works that way. I actually meant to remove that once we had a quantized transpose Matmul.

Jan 07 '24 14:01 awni

Qlora works with the quantized lm_head, but the issue is to merge it back into the original model. The model's performance is a lot worse after dequantizing the lm_head.

Jan 07 '24 14:01 mzbac

Qlora works with the quantized lm_head, but the issue is to merge it back into the original model. The model's performance is a lot worse after dequantizing the lm_head.

You mean the accuracy of the model is worse?

Can you give me steps to reproduce what you observe / more details. As far as I understand you did the following:

QLora with a quantized LM head as well (this gives specific loss on the test set)
Dequantize the whole model and merge adapters
Observe degradation on the test set loss?

Jan 07 '24 14:01 awni

Qlora works with the quantized lm_head, but the issue is to merge it back into the original model. The model's performance is a lot worse after dequantizing the lm_head.

You mean the accuracy of the model is worse?

Can you give me steps to reproduce what you observe / more details. As far as I understand you did the following:

QLora with a quantized LM head as well (this gives specific loss on the test set)

Dequantize the whole model and merge adapters

Observe degradation on the test set loss?

Use qlora with quantized linear layers (without lm_head).
Fine-tune for 600 iterations.
Dequantize all linear layers and merge lora back.
The output is worse than the original model with an adapter.

my script is here, and I am using https://huggingface.co/mlx-community/Mistral-7B-v0.1-hf-4bit-mlx as base model

And I tried to make another quant model without lm_head being quantized. Redo the step above, and the merged model is similar to the base with adapter (slightly worse, I guess due to the dequantizing of the linear layer during merge).

Jan 07 '24 14:01 mzbac

Qlora works with the quantized lm_head, but the issue is to merge it back into the original model. The model's performance is a lot worse after dequantizing the lm_head.

You mean the accuracy of the model is worse?

Can you give me steps to reproduce what you observe / more details. As far as I understand you did the following:

QLora with a quantized LM head as well (this gives specific loss on the test set)

Dequantize the whole model and merge adapters

Observe degradation on the test set loss?

Sorry, when I mentioned qlora with quan lm_head, I mean the base model includes a quant lm_head layer along with other linear layers (with lora). The current lora example throws an error when I try to convert lm_head to loraLinear.

Jan 07 '24 14:01 mzbac

Ok I think it's clear now, but just to be sure:

Download https://huggingface.co/mlx-community/Mistral-7B-v0.1-hf-4bit-mlx
Update the LoRA script to be able to load the quantized lm_head (as you've done here)
Fine-tune
Dequantize (all linear layers) and merge lora layers and observe performance regression?

I'm wondering if the performance regression has anything to do with LoRA or if it's more about the quantization + dequantization?

Also just curious: why did you de-quantize the LM head. Are you trying to produce a full fp16/higher precision model?

Jan 07 '24 15:01 awni

Yeah, I don't think the performance issue is caused by Lora. Personally, I feel that it's more like after using Lora, if we want to merge it back into the base model, then de-quantizing will cause some issues.

The reason I need to de-quantize the lm_head because when I merge Lora back into the original linear layers, those layers need to be de-quantized (to be able to merge weight). So after the merge, the model has all fp16 linear layers except for lm_head. Based on the current configuration (model config), we cannot have a mix of quantized and non-quantized linear layers in the model. Therefore, I have to de-quantize the lm_head and remove the quantization configuration from config.json.

Jan 07 '24 15:01 mzbac

Ok I think it's clear now, but just to be sure:

Download https://huggingface.co/mlx-community/Mistral-7B-v0.1-hf-4bit-mlx

Update the LoRA script to be able to load the quantized lm_head (as you've done here)

Fine-tune

Dequantize (all linear layers) and merge lora layers and observe performance regression?

I'm wondering if the performance regression has anything to do with LoRA or if it's more about the quantization + dequantization?

Also just curious: why did you de-quantize the LM head. Are you trying to produce a full fp16/higher precision model?

I skip to apply Lora to lm_head here. Then, I loaded the model with all linear layers converted to loraLinear, except for lm_head, which is still quantLinear.

Jan 07 '24 15:01 mzbac

I think this is closed as of #250 since the LM layer is now quantized by default.

I found in #252 that if you do qlora then fusing and keeping the model quantized works well provided you use a large scale for LoRA. So I changed the default scale.

Some follow on work needs to be done there to 1. make the scale configurable, 2. add more config params (lora layers, rank, etc) 3. update the readme to discuss how to tune LoRA

Jan 08 '24 21:01 awni

`lm_head` is quantized in the `hf_llm` example.