unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Loss and Grad Norm discrepancy during full finetuning

Open thedarkzeno opened this issue 1 year ago • 6 comments

I've been using unsloth for full finetuning on my models and it has greatly improved the speed of my experiments, however I decided to test the training from scratch and compared the loss curves with and without unsloth.

The test is using the mistral architecture (very small version just for testing)

Without Unsloth:

{'loss': 10.5604, 'grad_norm': 2.453125, 'learning_rate': 9.999999997532599e-05, 'epoch': 0.0}
{'loss': 10.2933, 'grad_norm': 2.34375, 'learning_rate': 9.999999990130396e-05, 'epoch': 0.0}
{'loss': 9.7562, 'grad_norm': 2.90625, 'learning_rate': 9.99999997779339e-05, 'epoch': 0.0}
{'loss': 9.3644, 'grad_norm': 2.046875, 'learning_rate': 9.999999960521582e-05, 'epoch': 0.0}
{'loss': 9.1276, 'grad_norm': 3.484375, 'learning_rate': 9.999999938314972e-05, 'epoch': 0.0}
{'loss': 8.9961, 'grad_norm': 1.859375, 'learning_rate': 9.999999911173561e-05, 'epoch': 0.0}
{'loss': 8.6878, 'grad_norm': 2.3125, 'learning_rate': 9.999999879097347e-05, 'epoch': 0.0}
{'loss': 8.8456, 'grad_norm': 1.8515625, 'learning_rate': 9.999999842086332e-05, 'epoch': 0.0}
{'loss': 8.7576, 'grad_norm': 1.609375, 'learning_rate': 9.999999800140513e-05, 'epoch': 0.0}
{'loss': 7.9193, 'grad_norm': 2.9375, 'learning_rate': 9.999999753259892e-05, 'epoch': 0.0}

With Unsloth:

{'loss': 10.5604, 'grad_norm': 0.58984375, 'learning_rate': 9.999999997532599e-05, 'epoch': 0.0}
{'loss': 10.5623, 'grad_norm': 0.453125, 'learning_rate': 9.999999990130396e-05, 'epoch': 0.0}
{'loss': 10.549, 'grad_norm': 0.482421875, 'learning_rate': 9.99999997779339e-05, 'epoch': 0.0}
{'loss': 10.5392, 'grad_norm': 0.4609375, 'learning_rate': 9.999999960521582e-05, 'epoch': 0.0}
{'loss': 10.5461, 'grad_norm': 1.1015625, 'learning_rate': 9.999999938314972e-05, 'epoch': 0.0}
{'loss': 10.5112, 'grad_norm': 0.478515625, 'learning_rate': 9.999999911173561e-05, 'epoch': 0.0}
{'loss': 10.4895, 'grad_norm': 0.65234375, 'learning_rate': 9.999999879097347e-05, 'epoch': 0.0}
{'loss': 10.495, 'grad_norm': 0.46484375, 'learning_rate': 9.999999842086332e-05, 'epoch': 0.0}
{'loss': 10.4738, 'grad_norm': 0.5078125, 'learning_rate': 9.999999800140513e-05, 'epoch': 0.0}
{'loss': 10.4712, 'grad_norm': 1.3828125, 'learning_rate': 9.999999753259892e-05, 'epoch': 0.0}
{'loss': 10.4655, 'grad_norm': 0.498046875, 'learning_rate': 9.999999701444471e-05, 'epoch': 0.0}
{'loss': 10.4464, 'grad_norm': 0.50390625, 'learning_rate': 9.999999644694247e-05, 'epoch': 0.0}
{'loss': 10.4683, 'grad_norm': 0.439453125, 'learning_rate': 9.99999958300922e-05, 'epoch': 0.0}

what could be cause of this difference? and how can I fix it?

thedarkzeno avatar Mar 17 '24 04:03 thedarkzeno

@thedarkzeno Oh wait full finetuning - did you make all layers (Q, K, V, O, gate, up, down) + layernorms + lm_head, embeddings all trainable?

I was gonna say I don't think Unsloth technically works for full finetuning specifically - the layernorms for eg won't update correctly and the rest might not be correct!!

I only verified QLoRA and LoRA losses, and they match perfectly - I'm not certain on full finetuning sadly

danielhanchen avatar Mar 17 '24 05:03 danielhanchen

Oh, I think I found the problem, while most of the parameters were trainable the embed_tokens were not, now it does converge faster, thanks.

thedarkzeno avatar Mar 17 '24 16:03 thedarkzeno

@thedarkzeno Oh I just added a fix for embed_tokens and lm_head :) You might have to update Unsloth :)

danielhanchen avatar Mar 17 '24 17:03 danielhanchen

@thedarkzeno On that note - do you know if the losses align now? :)

danielhanchen avatar Mar 18 '24 14:03 danielhanchen

image I ran a test setting the requires_grad forthe embed_tokens and lm_head and the result was this... (green line is with unloth) They don't exactly match, but got closer

thedarkzeno avatar Mar 18 '24 19:03 thedarkzeno

@thedarkzeno I'm assuming its the layernorms - we don't actually support FFT since the layernorm's gradients are more invovled to calculate, hence the difference

danielhanchen avatar Mar 19 '24 08:03 danielhanchen