transformers
transformers copied to clipboard
Question about an error when using mixed-precision-training on V100
System Info
SOFTWARE:
- transformers==4.39.3
- peft==0.9.0
- accelerate==0.27.2
- torch==1.13.1
HARDWARE:
- NVIDIA V100
Who can help?
@ArthurZucker @muellerzr @SunMarc
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Excuse me, there was an error when I using trainer
to train the llama2-7b on alpaca with mixed-precision-training.
I loaded the modle in float16
by setting torch_dtype=torch.float16
in .from_pretrained()
,and I set fp16=True
in TrainingArguments
, but an error occurred: 'Value Error: Attempting to scale FP16 gradients.'
then I tried to load the molde in float32
,still keeping fp16=True
, it worked successfully! But I found that the usage of the memory of GPU is the same as using fp16=False
. So I think the mixed-precision-training was not used, right?
Expected behavior
I want to know: 1) model in float32, TrainingArguments setting fp16/bf16=True or 2) model in float16/bfloat16, TrainingArguments setting fp16/bf16=True, which style is mixed-precision-training?
And I want to know how to solve the error: 'Value Error: Attempting to scale FP16 gradients.'
Thank you!
Hi @AIR-hl, thanks for raising an issue!
Could you share a minimal code snippet which reproduces the error, specifically showing how the trainer arguments are configured and the trainer called?
cc @younesbelkada too
Hi @AIR-hl !
You cannot perform pure fp16 training as it is not supported by pytorch. In order to do mixed precision fp16 training you should either load the model in full-precision and pass fp16=True
or pass a bf16 model to the trainer for pure bf16 training which is btw not supported on V100
Hi @AIR-hl ! You cannot perform pure fp16 training as it is not supported by pytorch. In order to do mixed precision fp16 training you should either load the model in full-precision and pass
fp16=True
or pass a bf16 model to the trainer for pure bf16 training which is btw not supported on V100
Ok, thank you very much! I think all my questions have been answered.
By the way, it's insteresting that when I use float16
to load model but both fp16
and bf16
are False
, the trainer can work, although the result is wrong (probably overflowed). Of course, it doesn't matter.
Thank you again for your contributions!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing this since the issue is solved !