unsloth Loss and Grad Norm discrepancy during full finetuning

I've been using unsloth for full finetuning on my models and it has greatly improved the speed of my experiments, however I decided to test the training from scratch and compared the loss curves with and without unsloth.

The test is using the mistral architecture (very small version just for testing)

Without Unsloth:

{'loss': 10.5604, 'grad_norm': 2.453125, 'learning_rate': 9.999999997532599e-05, 'epoch': 0.0}
{'loss': 10.2933, 'grad_norm': 2.34375, 'learning_rate': 9.999999990130396e-05, 'epoch': 0.0}
{'loss': 9.7562, 'grad_norm': 2.90625, 'learning_rate': 9.99999997779339e-05, 'epoch': 0.0}
{'loss': 9.3644, 'grad_norm': 2.046875, 'learning_rate': 9.999999960521582e-05, 'epoch': 0.0}
{'loss': 9.1276, 'grad_norm': 3.484375, 'learning_rate': 9.999999938314972e-05, 'epoch': 0.0}
{'loss': 8.9961, 'grad_norm': 1.859375, 'learning_rate': 9.999999911173561e-05, 'epoch': 0.0}
{'loss': 8.6878, 'grad_norm': 2.3125, 'learning_rate': 9.999999879097347e-05, 'epoch': 0.0}
{'loss': 8.8456, 'grad_norm': 1.8515625, 'learning_rate': 9.999999842086332e-05, 'epoch': 0.0}
{'loss': 8.7576, 'grad_norm': 1.609375, 'learning_rate': 9.999999800140513e-05, 'epoch': 0.0}
{'loss': 7.9193, 'grad_norm': 2.9375, 'learning_rate': 9.999999753259892e-05, 'epoch': 0.0}

With Unsloth:

{'loss': 10.5604, 'grad_norm': 0.58984375, 'learning_rate': 9.999999997532599e-05, 'epoch': 0.0}
{'loss': 10.5623, 'grad_norm': 0.453125, 'learning_rate': 9.999999990130396e-05, 'epoch': 0.0}
{'loss': 10.549, 'grad_norm': 0.482421875, 'learning_rate': 9.99999997779339e-05, 'epoch': 0.0}
{'loss': 10.5392, 'grad_norm': 0.4609375, 'learning_rate': 9.999999960521582e-05, 'epoch': 0.0}
{'loss': 10.5461, 'grad_norm': 1.1015625, 'learning_rate': 9.999999938314972e-05, 'epoch': 0.0}
{'loss': 10.5112, 'grad_norm': 0.478515625, 'learning_rate': 9.999999911173561e-05, 'epoch': 0.0}
{'loss': 10.4895, 'grad_norm': 0.65234375, 'learning_rate': 9.999999879097347e-05, 'epoch': 0.0}
{'loss': 10.495, 'grad_norm': 0.46484375, 'learning_rate': 9.999999842086332e-05, 'epoch': 0.0}
{'loss': 10.4738, 'grad_norm': 0.5078125, 'learning_rate': 9.999999800140513e-05, 'epoch': 0.0}
{'loss': 10.4712, 'grad_norm': 1.3828125, 'learning_rate': 9.999999753259892e-05, 'epoch': 0.0}
{'loss': 10.4655, 'grad_norm': 0.498046875, 'learning_rate': 9.999999701444471e-05, 'epoch': 0.0}
{'loss': 10.4464, 'grad_norm': 0.50390625, 'learning_rate': 9.999999644694247e-05, 'epoch': 0.0}
{'loss': 10.4683, 'grad_norm': 0.439453125, 'learning_rate': 9.99999958300922e-05, 'epoch': 0.0}

what could be cause of this difference? and how can I fix it?

Mar 17 '24 04:03 thedarkzeno

@thedarkzeno Oh wait full finetuning - did you make all layers (Q, K, V, O, gate, up, down) + layernorms + lm_head, embeddings all trainable?

I was gonna say I don't think Unsloth technically works for full finetuning specifically - the layernorms for eg won't update correctly and the rest might not be correct!!

I only verified QLoRA and LoRA losses, and they match perfectly - I'm not certain on full finetuning sadly

Mar 17 '24 05:03 danielhanchen

Oh, I think I found the problem, while most of the parameters were trainable the embed_tokens were not, now it does converge faster, thanks.

Mar 17 '24 16:03 thedarkzeno

@thedarkzeno Oh I just added a fix for embed_tokens and lm_head :) You might have to update Unsloth :)

Mar 17 '24 17:03 danielhanchen

@thedarkzeno On that note - do you know if the losses align now? :)

Mar 18 '24 14:03 danielhanchen

I ran a test setting the requires_grad forthe embed_tokens and lm_head and the result was this... (green line is with unloth) They don't exactly match, but got closer

Mar 18 '24 19:03 thedarkzeno

@thedarkzeno I'm assuming its the layernorms - we don't actually support FFT since the layernorm's gradients are more invovled to calculate, hence the difference

Mar 19 '24 08:03 danielhanchen

unsloth unsloth copied to clipboard

Loss and Grad Norm discrepancy during full finetuning

unsloth
unsloth copied to clipboard