transformers
transformers copied to clipboard
Error when setting a high batch-size: `AttributeError: 'NoneType' object has no attribute 'backward'`
System Info
Transformers version: latest@github Accelerate version: latest@github Deepspeed version: latest@github
Who can help?
@pacman100 @sgugger
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py
Use a high per_device_batch_size
and let Trainer
drop the batch size. Torchrun launcher with Deepspeed-Zero2.
[INFO|trainer.py:1786] 2023-06-28 09:03:54,973 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:03:54,973 >> Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:03:54,973 >> Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:03:54,973 >> Instantaneous batch size per device = 32
[INFO|trainer.py:1790] 2023-06-28 09:03:54,973 >> Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:03:54,973 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:03:54,973 >> Total optimization steps = 8
[INFO|trainer.py:1793] 2023-06-28 09:03:54,974 >> Number of trainable parameters = 8,388,608
0%| | 0/8 [00:00<?, ?it/s][INFO|trainer.py:1786] 2023-06-28 09:04:12,933 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:04:12,933 >> Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:04:12,934 >> Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:04:12,934 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1790] 2023-06-28 09:04:12,934 >> Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:04:12,934 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:04:12,934 >> Total optimization steps = 12
[INFO|trainer.py:1793] 2023-06-28 09:04:12,936 >> Number of trainable parameters = 8,388,608
0%| | 0/8 [00:16<?, ?it/s]
Traceback (most recent call last):t/s]
File "/app/finetune.py", line 796, in <module>
main()
File "/app/finetune.py", line 732, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/memory.py", line 132, in decorator
return function(batch_size, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2770, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1849, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
AttributeError: 'NoneType' object has no attribute 'backward'
In this case, I set the per_device_train_batch_size
to 32 which is too large for an A100-80 (knowingly). Trainer drops the batch-size from 32 to 16 when it overflows (which is expected behavior) but then fails because of self.accelerator.backward(loss)
.
Don't see this issue when I set a batch-size that fits the GPU, only when it overflows. I suspect accelerator.prepare
needs to be called again with the corrected batch-size.
Expected behavior
Trainer drops the batch size from 32 to 16 and training continues without failure.
cc @muellerzr (?)
@pacman100 could there be something more I need to check/do related to the deepspeed plugin when doing this that we might be missing? (basically is there a separate param that we should set on the batch size for the train bs here)
I can repro this so let me know if you need more logs. I'm trying to debug this myself too.
@orangetin can you tell us more about the deepspeed configuration you are using, how you are launching the script, and the args used? It looks like deepspeed isn't being properly set in the Accelerator hence the issue (or something on those lines). I have a feeling if you don't use deepspeed it will work
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@muellerzr
Same here. Problem occurs only when per_device_train_batch_size
is too large. but it's strange that when I used another tokenizer, things went right, and --auto_find_batch_size
worked normally.
Here is my command to run run_clm.py
(only a part of it) and deepspeed.config
.
deepspeed --include localhost:4,5,6,7 run_clm.py --model_type gpt2 --do_train --do_eval --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --auto_find_batch_size True --gradient_accumulation_steps 16 --learning_rate 0.001 --fp16 False --fp16_full_eval False
{
"fp16": {
"enabled": false
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.001,
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 0
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"train_micro_batch_size_per_gpu": "auto"
}
And my error track
File "/xxxx/run_clm.py", line 679, in <module>
main()
File "/xxxx/run_clm.py", line 627, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/xxxx/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/xxxx/lib/python3.10/site-packages/accelerate/utils/memory.py", line 136, in decorator
return function(batch_size, *args, **kwargs)
File "/xxxx/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/xxxx/lib/python3.10/site-packages/transformers/trainer.py", line 2693, in training_step
self.accelerator.backward(loss)
File "/xxxxxxxxxxx/lib/python3.10/site-packages/accelerate/accelerator.py", line 1917, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
AttributeError: 'NoneType' object has no attribute 'backward'
Thanks @ekkkkki for the context. I must have missed this. @muellerzr is this enough to go on or would you like more details?
Thanks for the ping, I'll take a look at this today or tommorow!
any updates for that bug? I can't run it even with batch_size of 2 or 8 (tried in sagemaker with ml.g5.12xlarge and ml.g4dn.12xlarge) I am out of ideas, even tried to go back to the commit in 21 Aug (which worked for me) and it doesn't (both transformers and accelerate) with deepspeed 0.10.0
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
cc @muellerzr are you still working on this?
I think this ought to be reopened. BTW, it only happens to me when I try to finetune a Llama2 derivative but not when I finetune Mistral or Zephyr. Disabling auto_find_batch_size
is indeed a workaround, but I'd really like to use that awesome feature.
cc @pacman100
The problem seems to be that release_memory
clears out the deepspeed_engine_wrapped attribute
https://github.com/huggingface/transformers/blob/35478182ce50d04bde5c4ecd0569c2f6ba15bee7/src/transformers/trainer.py#L1547
whenever we re-enter _inner_training_loop
which would have been fine but once the model is already wrapped in the previous try, accelerator.prepare
will not be called leaving accelerator.deepspeed_engine_wrapped
None
https://github.com/huggingface/transformers/blob/35478182ce50d04bde5c4ecd0569c2f6ba15bee7/src/transformers/trainer.py#L1655-L1660
Any hacks to get around this?
Well for now I am resolving it like so
class HFTrainer(Trainer):
def _inner_training_loop(self, batch_size=None, args=None, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None):
# Hack to fix: https://github.com/huggingface/transformers/issues/24558
if self.args.auto_find_batch_size:
self.model_wrapped = self.model
self.deepspeed = None
return super()._inner_training_loop(batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
I am not entirely sure if this is correct or would even the right decision for Zero 3, but at least makes Zero 1 and 2 work with auto batch size finder
So the workaround I was using worked fine for Llama 7B, but with Mistral 7B it is behaving weirdly, it seems to release memory on only one rank but not the other (using only 2 GPUs at the moment) and Trainer gets stuck completely
23291MiB
9439MiB
It seems like one rank went OOM and decided to adjust the batch to lower but the other rank didn't I'll debug some more but @pacman100 @sgugger can use some help from you for a proper fix 😅
My setup looks like this:
torch==2.1.1+cu118
transformers[accelerate,deepspeed,sentencepiece,tokenizers]==4.36.1
datasets==2.14.7
peft==0.6.2
bitsandbytes==0.41.3.post2
I am trying 4 bit qlora on 7B models with 2 GPUs
EDIT: On reading some code, when I enable this auto batch size finder, I wonder how do the ranks sync and agree on the same per-device batch size?
EDIT: It doesn't seem like the batch size gets correctly set in the deepspeed plugin once it is re-adjusted, so I am not sure if the optimizers, schedulers get initialized correctly 🤔
Hello, there are a lot of things being discussed in this single issue.
I am trying 4 bit qlora on 7B models with 2 GPUs
I don't think qlora is supported with DeepSpeed.
The problem seems to be that release_memory clears out the deepspeed_engine_wrapped attribute transformers/src/transformers/trainer.py
Line 1547 in 3547818
self.accelerator.free_memory() whenever we re-enter _inner_training_loop which would have been fine but once the model is already wrapped in the previous try, accelerator.prepare will not be called leaving accelerator.deepspeed_engine_wrapped None
Thanks for providing more details. This is a niche issue and based on the available bandwidth, we will prioritize it.
Hello, there are a lot of things being discussed in this single issue.
Agreed, sorry for that 😅
I don't think qlora is supported with DeepSpeed.
Interesting would like a separate discussion for this. It seems to work fine with Zero 2 with static batch size - even compared the loss curves with DDP - they are almost the same. Theoretically, also it makes sense as only optimizer and gradients will be sharded which in qlora are only the trainable adapters in bfloat16/float16/float32. I have seen the community using axolotl also use it successfully. Zero 3 indeed does not work. Anyway, not the topic for this issue.
The only reason I brought that up here is because Deepspeed Zero sharding can cause uneven consumption on GPUs and the ranks can then disagree on batch sizes and everything gets stuck
I don't think qlora is supported with DeepSpeed.
I use DeepSpeed (ZeRO-2) with both LoRA and QLoRA, and it works great—until I enable auto_find_batch_size
.
Thanks for providing more details. This is a niche issue and based on the available bandwidth, we will prioritize it.
This is a niche issue? I feel like most people would rather make use of auto_find_batch_size
and avoid OOM errors with ease. BTW, I was wrong. This problem does occur when finetuning Llama2 models.
I use DeepSpeed (ZeRO-2) with both LoRA and QLoRA, and it works great—until I enable auto_find_batch_size.
Nice, I meant DeepSpeed ZeRO 3 + QLoRA, should have been clear about it.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I definitely don't want to see this issue marked as stale.
@mhillebrand it'll be closed after we merge #28088 which adds the support in for auto batch size finder :)
@mhillebrand it'll be closed after we merge #28088 which adds the support in for auto batch size finder :)
Ah, I didn't see the linked PR from a month ago. Thank you!