qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Whenever I use QLoRA to train LLama/LLama 2 on an instruction-tuning dataset like Dolly or Alpaca I get a periodically oscillating training loss

Open ritabratamaiti opened this issue 1 year ago • 12 comments

image

Is this behavior normal/acceptable? Why does it happen?

ritabratamaiti avatar Jul 23 '23 10:07 ritabratamaiti

I have similar sawtooth shape loss on Alpaca data, excerpt of my training output log is here:

{'loss': 1.5872, 'learning_rate': 1e-06, 'epoch': 0.01} {'loss': 1.237, 'learning_rate': 1e-06, 'epoch': 0.02} {'loss': 1.4684, 'learning_rate': 1e-06, 'epoch': 0.04} {'loss': 2.1779, 'learning_rate': 1e-06, 'epoch': 0.05} {'loss': 3.357, 'learning_rate': 1e-06, 'epoch': 0.06} {'loss': 1.5047, 'learning_rate': 1e-06, 'epoch': 0.07} {'loss': 1.2749, 'learning_rate': 1e-06, 'epoch': 0.08} {'loss': 1.477, 'learning_rate': 1e-06, 'epoch': 0.1} {'loss': 2.1822, 'learning_rate': 1e-06, 'epoch': 0.11} {'loss': 3.2731, 'learning_rate': 1e-06, 'epoch': 0.12} {'loss': 1.5442, 'learning_rate': 1e-06, 'epoch': 0.13} {'loss': 1.2816, 'learning_rate': 1e-06, 'epoch': 0.14} {'loss': 1.4423, 'learning_rate': 1e-06, 'epoch': 0.16} {'loss': 2.1455, 'learning_rate': 1e-06, 'epoch': 0.17} {'loss': 3.2909, 'learning_rate': 1e-06, 'epoch': 0.18} {'loss': 1.6531, 'learning_rate': 1e-06, 'epoch': 0.19} {'loss': 1.2675, 'learning_rate': 1e-06, 'epoch': 0.2}

Does that mean it is wrong trending as far as fine-tuning is concerned? Or what loss should be the key indicator for fine-tuning on Alpaca?

Thanks!

BTW, @ritabratamaiti , how did you get the above plot?

bqcao avatar Jul 23 '23 11:07 bqcao

This might be due to the "group by length" option, try disabling it.

--group_by_length [GROUP_BY_LENGTH]
    Group sequences into batches with same length. Saves memory and speeds up training considerably. (default: True)

BugReporterZ avatar Jul 23 '23 12:07 BugReporterZ

@BugReporterZ Could you explain the reasoning for why group_by_length may be causing this issue?

vincentmin avatar Jul 23 '23 14:07 vincentmin

It appears to group training examples in length-ordered chunks, and the longer training examples at the start of these chunks will show a higher loss. I also recall reading elsewhere that it can cause an "oscillating" training loss curve, which is consistent with what you're seeing, maybe it was this comment by artidoro:

https://github.com/artidoro/qlora/issues/84#issuecomment-1572408347

BugReporterZ avatar Jul 23 '23 14:07 BugReporterZ

@bqcao This is from weights and biases (wandb); I set it up to visualize training.

@BugReporterZ I see! Thanks for the explanation. Has there been any work on incorporating QLoRA with SFTTrainer?

ritabratamaiti avatar Jul 23 '23 14:07 ritabratamaiti

@ritabratamaiti I'm not aware of efforts in that regard, unfortunately.

BugReporterZ avatar Jul 23 '23 14:07 BugReporterZ

@ritabratamaiti Yes, QLoRA is supported by SFTTrainer. You can use this example script and set load_in_4bit=True and use_peft=True. https://github.com/lvwerra/trl/blob/main/examples/scripts/sft_trainer.py See this blog for more details: https://huggingface.co/blog/4bit-transformers-bitsandbytes

@BugReporterZ thanks for the explanation.

vincentmin avatar Jul 23 '23 14:07 vincentmin

Thanks @BugReporterZ and @vincentmin

ritabratamaiti avatar Jul 23 '23 15:07 ritabratamaiti

Thanks @BugReporterZ ! Yes indeed, after disabled group_by_length, I don't see the sawtooth shape anymore. Appreciate @ritabratamaiti and @vincentmin as well!

bqcao avatar Jul 23 '23 22:07 bqcao

@BugReporterZ What is the impact on training if we disable group_by_lenght? Is it comparable to it being set to true and only gain is saving memory?

usmanxia avatar Aug 01 '23 16:08 usmanxia

I haven't investigated that in detail. I have always left that enabled because the eval loss curve didn't seem to be affected. You could refer to the Transformers documentation for what it does (same as what was relayed by Artidoro in the comment I linked earlier):

https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.group_by_length

group_by_length (bool, optional, defaults to False) — Whether or not to group together samples of roughly the same length in the training dataset (to minimize padding applied and be more efficient). Only useful if applying dynamic padding.

BugReporterZ avatar Aug 01 '23 16:08 BugReporterZ

Got it, thank you

usmanxia avatar Aug 01 '23 18:08 usmanxia