unsloth
unsloth copied to clipboard
Loss and Grad Norm discrepancy during full finetuning
I've been using unsloth for full finetuning on my models and it has greatly improved the speed of my experiments, however I decided to test the training from scratch and compared the loss curves with and without unsloth.
The test is using the mistral architecture (very small version just for testing)
Without Unsloth:
{'loss': 10.5604, 'grad_norm': 2.453125, 'learning_rate': 9.999999997532599e-05, 'epoch': 0.0}
{'loss': 10.2933, 'grad_norm': 2.34375, 'learning_rate': 9.999999990130396e-05, 'epoch': 0.0}
{'loss': 9.7562, 'grad_norm': 2.90625, 'learning_rate': 9.99999997779339e-05, 'epoch': 0.0}
{'loss': 9.3644, 'grad_norm': 2.046875, 'learning_rate': 9.999999960521582e-05, 'epoch': 0.0}
{'loss': 9.1276, 'grad_norm': 3.484375, 'learning_rate': 9.999999938314972e-05, 'epoch': 0.0}
{'loss': 8.9961, 'grad_norm': 1.859375, 'learning_rate': 9.999999911173561e-05, 'epoch': 0.0}
{'loss': 8.6878, 'grad_norm': 2.3125, 'learning_rate': 9.999999879097347e-05, 'epoch': 0.0}
{'loss': 8.8456, 'grad_norm': 1.8515625, 'learning_rate': 9.999999842086332e-05, 'epoch': 0.0}
{'loss': 8.7576, 'grad_norm': 1.609375, 'learning_rate': 9.999999800140513e-05, 'epoch': 0.0}
{'loss': 7.9193, 'grad_norm': 2.9375, 'learning_rate': 9.999999753259892e-05, 'epoch': 0.0}
With Unsloth:
{'loss': 10.5604, 'grad_norm': 0.58984375, 'learning_rate': 9.999999997532599e-05, 'epoch': 0.0}
{'loss': 10.5623, 'grad_norm': 0.453125, 'learning_rate': 9.999999990130396e-05, 'epoch': 0.0}
{'loss': 10.549, 'grad_norm': 0.482421875, 'learning_rate': 9.99999997779339e-05, 'epoch': 0.0}
{'loss': 10.5392, 'grad_norm': 0.4609375, 'learning_rate': 9.999999960521582e-05, 'epoch': 0.0}
{'loss': 10.5461, 'grad_norm': 1.1015625, 'learning_rate': 9.999999938314972e-05, 'epoch': 0.0}
{'loss': 10.5112, 'grad_norm': 0.478515625, 'learning_rate': 9.999999911173561e-05, 'epoch': 0.0}
{'loss': 10.4895, 'grad_norm': 0.65234375, 'learning_rate': 9.999999879097347e-05, 'epoch': 0.0}
{'loss': 10.495, 'grad_norm': 0.46484375, 'learning_rate': 9.999999842086332e-05, 'epoch': 0.0}
{'loss': 10.4738, 'grad_norm': 0.5078125, 'learning_rate': 9.999999800140513e-05, 'epoch': 0.0}
{'loss': 10.4712, 'grad_norm': 1.3828125, 'learning_rate': 9.999999753259892e-05, 'epoch': 0.0}
{'loss': 10.4655, 'grad_norm': 0.498046875, 'learning_rate': 9.999999701444471e-05, 'epoch': 0.0}
{'loss': 10.4464, 'grad_norm': 0.50390625, 'learning_rate': 9.999999644694247e-05, 'epoch': 0.0}
{'loss': 10.4683, 'grad_norm': 0.439453125, 'learning_rate': 9.99999958300922e-05, 'epoch': 0.0}
what could be cause of this difference? and how can I fix it?
@thedarkzeno Oh wait full finetuning - did you make all layers (Q, K, V, O, gate, up, down) + layernorms + lm_head, embeddings all trainable?
I was gonna say I don't think Unsloth technically works for full finetuning specifically - the layernorms for eg won't update correctly and the rest might not be correct!!
I only verified QLoRA and LoRA losses, and they match perfectly - I'm not certain on full finetuning sadly
Oh, I think I found the problem, while most of the parameters were trainable the embed_tokens were not, now it does converge faster, thanks.
@thedarkzeno Oh I just added a fix for embed_tokens and lm_head :) You might have to update Unsloth :)
@thedarkzeno On that note - do you know if the losses align now? :)
I ran a test setting the requires_grad forthe embed_tokens and lm_head and the result was this... (green line is with unloth)
They don't exactly match, but got closer
@thedarkzeno I'm assuming its the layernorms - we don't actually support FFT since the layernorm's gradients are more invovled to calculate, hence the difference