transformers icon indicating copy to clipboard operation
transformers copied to clipboard

New GA fix causes training loss multiple times higher across the board (5x to 10x higher)

Open JianbangZ opened this issue 1 year ago • 1 comments

System Info

8xH100

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

After updating to the latest master branch of transformer, the training loss is mutiple times higher than before (5x-10x). I tried both SFT and DPO (paired with latest trl master), all having the same problems. SFT after GA fix image

SFT before GA fix image

Expected behavior

training loss value should be aligned with old values, or should be expected lower.

JianbangZ avatar Oct 19 '24 11:10 JianbangZ

I also experienced a reduction in loss while training, but the value loss and grad_norm became very large. According to the blog (https://huggingface.co/blog/gradient_accumulation), it seems the code has changed to 'sum' the loss instead of using 'mean', and then divide it by 'num_items'. I'm wondering if this 'sum' value is being output. If I'm mistaken, please let me know.

image

KeonwooChoi avatar Oct 19 '24 13:10 KeonwooChoi

Can you share more info? It would be useful to know at least the training arguments and model used.

benjamin-marie avatar Oct 21 '24 09:10 benjamin-marie

Actually would it be possible if you both @JianbangZ and @KeonwooChoi try it via Unsloth - use the same dataset if possible - thanks!

danielhanchen avatar Oct 21 '24 09:10 danielhanchen

here is my experiment can confirm the fix still needs to be fixed.

paulcx avatar Oct 21 '24 12:10 paulcx

Will take a look since https://github.com/huggingface/transformers/pull/34198 is really what fixed it on the trainer

muellerzr avatar Oct 21 '24 12:10 muellerzr

Just an FYI for a true test it would be better to compare this against using no gradient accumulation. I wouldn't be surprised if LR hyper parameters are different now, but looking at this today/right now

muellerzr avatar Oct 21 '24 13:10 muellerzr

@muellerzr Does transformer need to update all the loss function in the corresponding modeling files to give the "num_items_in_batch" as argument? Such as https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1212C94-L1212C118 Fix is mentioned here (though in Chinese) https://github.com/hiyouga/LLaMA-Factory/issues/5747#issuecomment-2425488226

JianbangZ avatar Oct 21 '24 13:10 JianbangZ

Yeah. Will do so today

muellerzr avatar Oct 21 '24 13:10 muellerzr

@JianbangZ you can build off of pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues. It has the rest of the models + the full fix for trainer

muellerzr avatar Oct 21 '24 13:10 muellerzr

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues

do you have to give num_items_in_batch as a trainer argument now?

JianbangZ avatar Oct 21 '24 14:10 JianbangZ

No, a user never has to keep track of that

muellerzr avatar Oct 21 '24 14:10 muellerzr

No, a user never has to keep track of that

Tried, loss value still very big. Seems num_items_in_batch is not in effect

JianbangZ avatar Oct 21 '24 14:10 JianbangZ

Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline

muellerzr avatar Oct 21 '24 14:10 muellerzr

Also though: they shouldn't be the same, they will be different than the old version because they are two different calculations fully. Hence a reproducer and test with no grad accum

muellerzr avatar Oct 21 '24 14:10 muellerzr

Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline

I found out why. It's your self.model_accepts_loss_kwargs not set as True (how to set it dynamically?). Anyway, I hard coded it to True and loss value seems to become normal. I will run a full 3 hour training (a multimodal training with Llama3.2-3B as backend). Then I will report the loss curves here.

JianbangZ avatar Oct 21 '24 15:10 JianbangZ

@JianbangZ which model are you testing with? llama3.2-3B? We look at the forward() signature to see if it accepts the new loss_kwargs argument

muellerzr avatar Oct 21 '24 15:10 muellerzr

@JianbangZ which model are you testing with? llama3.2-3B?

@JianbangZ which model are you testing with? llama3.2-3B? We look at the forward() signature to see if it accepts the new loss_kwargs argument

Yes, I have a script to use Llama-3.2-3B-it as LLM backend, and do a LLAVA style training. I think it's good to test a MLLM training beyond just regular LLM SFT training to see how things go. And where to set or how to determine forward() signature? Shouldn't loss_kwargs always be accepted?

JianbangZ avatar Oct 21 '24 15:10 JianbangZ

To my knowledge this is really only for CLM/LLM training so the PR limits it to that (e.g. it wouldn't make sense with classification/problems where the lengths of inputs matter wrt the loss func). You can see in the details of the PR the where/how the models are updated with this new mechanic.

What does model.loss_function give you?

muellerzr avatar Oct 21 '24 15:10 muellerzr

Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74) image

JianbangZ avatar Oct 21 '24 21:10 JianbangZ

Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74) image

Would you mind summarize the solution for this issue?

paulcx avatar Oct 22 '24 00:10 paulcx

@muellerzr Thank you for the awesome work in your PR. This is a noob question, but:

If this wasn't OOTB in the trainer api, were we intended to provide the compute_loss_fn argument? Could you provide a small example of how that would look like? Just the source code for your experiments would also work!

man-shar avatar Oct 22 '24 01:10 man-shar

@man-shar the docs in the Trainer are fairly detailed, but there's an example test here: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#740-L834

If I had to guess, the reason why I was confused earlier is it was a PEFT method was the reason it didn't work

muellerzr avatar Oct 22 '24 12:10 muellerzr

Ah awesome, thanks!

man-shar avatar Oct 22 '24 12:10 man-shar

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues Then, eable loss_kwargs in forward(), or you can hardcoded loss_kwars to True in trainer

JianbangZ avatar Oct 22 '24 13:10 JianbangZ

Hey @JianbangZ , can you pls give an example ? I'm not able to find a paramter called "loss_kwargs" in the trainer class

thesillystudent avatar Oct 22 '24 16:10 thesillystudent

@thesillystudent if you build off my branch again we take into account peft models which should fix the issue. You, as a user, do not have access to loss_kwargs in the Trainer. It's internal.

You can however create a compute_loss_func which you can make.

(This will be merged shortly)

muellerzr avatar Oct 22 '24 16:10 muellerzr

compute_loss_func

would you give an example of how to use compute_loss_func?

paulcx avatar Oct 22 '24 18:10 paulcx

@paulcx as mentioned there's an example in the test:

        model = AutoModelForCausalLM.from_pretrained(model_name)

        def compute_loss(logits, labels, vocab_size, num_items_in_batch, disable_num_items_in_batch=False):
            return ForCausalLMLoss(
                logits["logits"], labels, vocab_size, num_items_in_batch, disable_num_items_in_batch
            )

        loss_fn = partial(compute_loss, vocab_size=model.config.vocab_size, disable_num_items_in_batch=False)

muellerzr avatar Oct 23 '24 16:10 muellerzr

I have re-install: pip install git+https://github.com/huggingface/transformers

Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check). enable_liger_kernel: true

The loss is still way high, like 10 times higher then before, although eval loss is small as before!

Do you know why ?

Thanks, Steve

thusinh1969 avatar Oct 25 '24 21:10 thusinh1969

Anyone can summarize what has happened since transformers==4.46.0 in terms of gradient accumulation steps and does it influence users when upgrading from 4.43 to 4.46.0 ? Thanks !

haorannlp avatar Dec 02 '24 15:12 haorannlp