qlora
qlora copied to clipboard
Whenever I use QLoRA to train LLama/LLama 2 on an instruction-tuning dataset like Dolly or Alpaca I get a periodically oscillating training loss
Is this behavior normal/acceptable? Why does it happen?
I have similar sawtooth shape loss on Alpaca data, excerpt of my training output log is here:
{'loss': 1.5872, 'learning_rate': 1e-06, 'epoch': 0.01} {'loss': 1.237, 'learning_rate': 1e-06, 'epoch': 0.02} {'loss': 1.4684, 'learning_rate': 1e-06, 'epoch': 0.04} {'loss': 2.1779, 'learning_rate': 1e-06, 'epoch': 0.05} {'loss': 3.357, 'learning_rate': 1e-06, 'epoch': 0.06} {'loss': 1.5047, 'learning_rate': 1e-06, 'epoch': 0.07} {'loss': 1.2749, 'learning_rate': 1e-06, 'epoch': 0.08} {'loss': 1.477, 'learning_rate': 1e-06, 'epoch': 0.1} {'loss': 2.1822, 'learning_rate': 1e-06, 'epoch': 0.11} {'loss': 3.2731, 'learning_rate': 1e-06, 'epoch': 0.12} {'loss': 1.5442, 'learning_rate': 1e-06, 'epoch': 0.13} {'loss': 1.2816, 'learning_rate': 1e-06, 'epoch': 0.14} {'loss': 1.4423, 'learning_rate': 1e-06, 'epoch': 0.16} {'loss': 2.1455, 'learning_rate': 1e-06, 'epoch': 0.17} {'loss': 3.2909, 'learning_rate': 1e-06, 'epoch': 0.18} {'loss': 1.6531, 'learning_rate': 1e-06, 'epoch': 0.19} {'loss': 1.2675, 'learning_rate': 1e-06, 'epoch': 0.2}
Does that mean it is wrong trending as far as fine-tuning is concerned? Or what loss should be the key indicator for fine-tuning on Alpaca?
Thanks!
BTW, @ritabratamaiti , how did you get the above plot?
This might be due to the "group by length" option, try disabling it.
--group_by_length [GROUP_BY_LENGTH]
Group sequences into batches with same length. Saves memory and speeds up training considerably. (default: True)
@BugReporterZ Could you explain the reasoning for why group_by_length may be causing this issue?
It appears to group training examples in length-ordered chunks, and the longer training examples at the start of these chunks will show a higher loss. I also recall reading elsewhere that it can cause an "oscillating" training loss curve, which is consistent with what you're seeing, maybe it was this comment by artidoro:
https://github.com/artidoro/qlora/issues/84#issuecomment-1572408347
@bqcao This is from weights and biases (wandb); I set it up to visualize training.
@BugReporterZ I see! Thanks for the explanation. Has there been any work on incorporating QLoRA with SFTTrainer?
@ritabratamaiti I'm not aware of efforts in that regard, unfortunately.
@ritabratamaiti Yes, QLoRA is supported by SFTTrainer. You can use this example script and set load_in_4bit=True
and use_peft=True
.
https://github.com/lvwerra/trl/blob/main/examples/scripts/sft_trainer.py
See this blog for more details: https://huggingface.co/blog/4bit-transformers-bitsandbytes
@BugReporterZ thanks for the explanation.
Thanks @BugReporterZ and @vincentmin
Thanks @BugReporterZ ! Yes indeed, after disabled group_by_length, I don't see the sawtooth shape anymore. Appreciate @ritabratamaiti and @vincentmin as well!
@BugReporterZ What is the impact on training if we disable group_by_lenght? Is it comparable to it being set to true and only gain is saving memory?
I haven't investigated that in detail. I have always left that enabled because the eval loss curve didn't seem to be affected. You could refer to the Transformers documentation for what it does (same as what was relayed by Artidoro in the comment I linked earlier):
https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.group_by_length
group_by_length (bool, optional, defaults to False) — Whether or not to group together samples of roughly the same length in the training dataset (to minimize padding applied and be more efficient). Only useful if applying dynamic padding.
Got it, thank you