transformers
transformers copied to clipboard
New GA fix causes training loss multiple times higher across the board (5x to 10x higher)
System Info
8xH100
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
After updating to the latest master branch of transformer, the training loss is mutiple times higher than before (5x-10x). I tried both SFT and DPO (paired with latest trl master), all having the same problems.
SFT after GA fix
SFT before GA fix
Expected behavior
training loss value should be aligned with old values, or should be expected lower.
I also experienced a reduction in loss while training, but the value loss and grad_norm became very large. According to the blog (https://huggingface.co/blog/gradient_accumulation), it seems the code has changed to 'sum' the loss instead of using 'mean', and then divide it by 'num_items'. I'm wondering if this 'sum' value is being output. If I'm mistaken, please let me know.
Can you share more info? It would be useful to know at least the training arguments and model used.
Actually would it be possible if you both @JianbangZ and @KeonwooChoi try it via Unsloth - use the same dataset if possible - thanks!
here is my experiment can confirm the fix still needs to be fixed.
Will take a look since https://github.com/huggingface/transformers/pull/34198 is really what fixed it on the trainer
Just an FYI for a true test it would be better to compare this against using no gradient accumulation. I wouldn't be surprised if LR hyper parameters are different now, but looking at this today/right now
@muellerzr Does transformer need to update all the loss function in the corresponding modeling files to give the "num_items_in_batch" as argument? Such as https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1212C94-L1212C118 Fix is mentioned here (though in Chinese) https://github.com/hiyouga/LLaMA-Factory/issues/5747#issuecomment-2425488226
Yeah. Will do so today
@JianbangZ you can build off of pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues. It has the rest of the models + the full fix for trainer
pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues
do you have to give num_items_in_batch as a trainer argument now?
No, a user never has to keep track of that
No, a user never has to keep track of that
Tried, loss value still very big. Seems num_items_in_batch is not in effect
Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline
Also though: they shouldn't be the same, they will be different than the old version because they are two different calculations fully. Hence a reproducer and test with no grad accum
Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline
I found out why. It's your self.model_accepts_loss_kwargs not set as True (how to set it dynamically?). Anyway, I hard coded it to True and loss value seems to become normal. I will run a full 3 hour training (a multimodal training with Llama3.2-3B as backend). Then I will report the loss curves here.
@JianbangZ which model are you testing with? llama3.2-3B? We look at the forward() signature to see if it accepts the new loss_kwargs argument
@JianbangZ which model are you testing with? llama3.2-3B?
@JianbangZ which model are you testing with? llama3.2-3B? We look at the
forward()signature to see if it accepts the newloss_kwargsargument
Yes, I have a script to use Llama-3.2-3B-it as LLM backend, and do a LLAVA style training. I think it's good to test a MLLM training beyond just regular LLM SFT training to see how things go. And where to set or how to determine forward() signature? Shouldn't loss_kwargs always be accepted?
To my knowledge this is really only for CLM/LLM training so the PR limits it to that (e.g. it wouldn't make sense with classification/problems where the lengths of inputs matter wrt the loss func). You can see in the details of the PR the where/how the models are updated with this new mechanic.
What does model.loss_function give you?
Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74)
Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74)
Would you mind summarize the solution for this issue?
@muellerzr Thank you for the awesome work in your PR. This is a noob question, but:
If this wasn't OOTB in the trainer api, were we intended to provide the compute_loss_fn argument? Could you provide a small example of how that would look like? Just the source code for your experiments would also work!
@man-shar the docs in the Trainer are fairly detailed, but there's an example test here: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#740-L834
If I had to guess, the reason why I was confused earlier is it was a PEFT method was the reason it didn't work
Ah awesome, thanks!
pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues
pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues Then, eable loss_kwargs in forward(), or you can hardcoded loss_kwars to True in trainer
Hey @JianbangZ , can you pls give an example ? I'm not able to find a paramter called "loss_kwargs" in the trainer class
@thesillystudent if you build off my branch again we take into account peft models which should fix the issue. You, as a user, do not have access to loss_kwargs in the Trainer. It's internal.
You can however create a compute_loss_func which you can make.
(This will be merged shortly)
compute_loss_func
would you give an example of how to use compute_loss_func?
@paulcx as mentioned there's an example in the test:
model = AutoModelForCausalLM.from_pretrained(model_name)
def compute_loss(logits, labels, vocab_size, num_items_in_batch, disable_num_items_in_batch=False):
return ForCausalLMLoss(
logits["logits"], labels, vocab_size, num_items_in_batch, disable_num_items_in_batch
)
loss_fn = partial(compute_loss, vocab_size=model.config.vocab_size, disable_num_items_in_batch=False)
I have re-install: pip install git+https://github.com/huggingface/transformers
Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check). enable_liger_kernel: true
The loss is still way high, like 10 times higher then before, although eval loss is small as before!
Do you know why ?
Thanks, Steve
Anyone can summarize what has happened since transformers==4.46.0 in terms of gradient accumulation steps and does it influence users when upgrading from 4.43 to 4.46.0 ? Thanks !
