unsloth
unsloth copied to clipboard
Why is lm_head in modules_to_save? Why not "norm"?
It makes sense that "embed_tokens" should be specified in "modules_to_save" since that is not a linear layer.
But, lm_head is a linear layer - so why not allow LoRA to be applied there?
Also, why not allow "norm" to be made trainable by adding to "modules_to_save"?
Sadly norm will need gradients for the layernorms, which are horrifying to write up in Triton
Thanks @danielhanchen , noted on the norms.
And why not allow LoRA to be applied to lm_head?
@RonanKMcGovern Oh it can be done! It's not a normal thing to do, but it can be enabled - hmmm
makes sense. Yeah, I haven't strong evidence of it being needed, but I recall reading about making both embed tokens AND norms trainable for best performance in chat fine-tuning (when setting/changing the chat template).
On Fri, May 24, 2024 at 11:42 AM Daniel Han @.***> wrote:
@RonanKMcGovern https://github.com/RonanKMcGovern Oh it can be done! It's not a normal thing to do, but it can be enabled - hmmm
— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/500#issuecomment-2129218997, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CWQYRPYBIAFTAV2LB3ZD4KSVAVCNFSM6AAAAABIBYVEQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGIYTQOJZG4 . You are receiving this because you were mentioned.Message ID: @.***>
Oh if norms and embed_tokens and every thing is enabled, that's literally full finetuning, except the weight updates are low rank :))
The layernorm's gradients are just way too tedious to derive sadly
Hi @danielhanchen,
I'm just in a process of moving my llm finetuning to unsloath. I'm impressed with the speed it gives but struggle to get the same results as before. Inspecting adapted_config I noticed that "lm_head" which I had in "target_modules", in unsloth is moved to "modules_to_save", why is that?
As well I noticed that with this change the model overfits to training data more quickly than before.
If you turn on training the lm_head, then it might overfit, which is normal - I normally suggest just leaving it out
Hopefully it's all solved now? By the way we have new docs! https://docs.unsloth.ai/