transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Question about an error when using mixed-precision-training on V100

Open AIR-hl opened this issue 10 months ago • 4 comments

System Info

SOFTWARE:

  • transformers==4.39.3
  • peft==0.9.0
  • accelerate==0.27.2
  • torch==1.13.1

HARDWARE:

  • NVIDIA V100

Who can help?

@ArthurZucker @muellerzr @SunMarc

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Excuse me, there was an error when I using trainer to train the llama2-7b on alpaca with mixed-precision-training.

I loaded the modle in float16 by setting torch_dtype=torch.float16 in .from_pretrained(),and I set fp16=True in TrainingArguments, but an error occurred: 'Value Error: Attempting to scale FP16 gradients.' f423a099058cd8196931c8e852502a9

then I tried to load the molde in float32,still keeping fp16=True, it worked successfully! But I found that the usage of the memory of GPU is the same as using fp16=False. So I think the mixed-precision-training was not used, right?

Expected behavior

I want to know: 1) model in float32, TrainingArguments setting fp16/bf16=True or 2) model in float16/bfloat16, TrainingArguments setting fp16/bf16=True, which style is mixed-precision-training?

And I want to know how to solve the error: 'Value Error: Attempting to scale FP16 gradients.'

Thank you!

AIR-hl avatar Apr 07 '24 15:04 AIR-hl

Hi @AIR-hl, thanks for raising an issue!

Could you share a minimal code snippet which reproduces the error, specifically showing how the trainer arguments are configured and the trainer called?

cc @younesbelkada too

amyeroberts avatar Apr 08 '24 13:04 amyeroberts

Hi @AIR-hl ! You cannot perform pure fp16 training as it is not supported by pytorch. In order to do mixed precision fp16 training you should either load the model in full-precision and pass fp16=True or pass a bf16 model to the trainer for pure bf16 training which is btw not supported on V100

younesbelkada avatar Apr 08 '24 14:04 younesbelkada

Hi @AIR-hl ! You cannot perform pure fp16 training as it is not supported by pytorch. In order to do mixed precision fp16 training you should either load the model in full-precision and pass fp16=True or pass a bf16 model to the trainer for pure bf16 training which is btw not supported on V100

Ok, thank you very much! I think all my questions have been answered.

By the way, it's insteresting that when I use float16 to load model but both fp16 and bf16 are False, the trainer can work, although the result is wrong (probably overflowed). Of course, it doesn't matter.

Thank you again for your contributions! (GUM{A15RH 0(S{7ZN ZOJL

AIR-hl avatar Apr 08 '24 16:04 AIR-hl

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 08 '24 08:05 github-actions[bot]

Closing this since the issue is solved !

SunMarc avatar May 13 '24 09:05 SunMarc