transformers Error when setting a high batch-size: `AttributeError: 'NoneType' object has no attribute 'backward'`

System Info

Transformers version: latest@github Accelerate version: latest@github Deepspeed version: latest@github

Who can help?

@pacman100 @sgugger

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

Use a high per_device_batch_size and let Trainer drop the batch size. Torchrun launcher with Deepspeed-Zero2.

[INFO|trainer.py:1786] 2023-06-28 09:03:54,973 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:03:54,973 >>   Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:03:54,973 >>   Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:03:54,973 >>   Instantaneous batch size per device = 32
[INFO|trainer.py:1790] 2023-06-28 09:03:54,973 >>   Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:03:54,973 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:03:54,973 >>   Total optimization steps = 8
[INFO|trainer.py:1793] 2023-06-28 09:03:54,974 >>   Number of trainable parameters = 8,388,608
  0%|          | 0/8 [00:00<?, ?it/s][INFO|trainer.py:1786] 2023-06-28 09:04:12,933 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-06-28 09:04:12,933 >>   Num examples = 338
[INFO|trainer.py:1788] 2023-06-28 09:04:12,934 >>   Num Epochs = 4
[INFO|trainer.py:1789] 2023-06-28 09:04:12,934 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1790] 2023-06-28 09:04:12,934 >>   Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:1791] 2023-06-28 09:04:12,934 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1792] 2023-06-28 09:04:12,934 >>   Total optimization steps = 12
[INFO|trainer.py:1793] 2023-06-28 09:04:12,936 >>   Number of trainable parameters = 8,388,608
  0%|          | 0/8 [00:16<?, ?it/s]
Traceback (most recent call last):t/s]
  File "/app/finetune.py", line 796, in <module>
    main()
  File "/app/finetune.py", line 732, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/memory.py", line 132, in decorator
    return function(batch_size, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2770, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1849, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
AttributeError: 'NoneType' object has no attribute 'backward'

In this case, I set the per_device_train_batch_size to 32 which is too large for an A100-80 (knowingly). Trainer drops the batch-size from 32 to 16 when it overflows (which is expected behavior) but then fails because of self.accelerator.backward(loss).

Don't see this issue when I set a batch-size that fits the GPU, only when it overflows. I suspect accelerator.prepare needs to be called again with the corrected batch-size.

Expected behavior

Trainer drops the batch size from 32 to 16 and training continues without failure.

Jun 28 '23 19:06 orangetin

cc @muellerzr (?)

Jun 29 '23 06:06 ydshieh

@pacman100 could there be something more I need to check/do related to the deepspeed plugin when doing this that we might be missing? (basically is there a separate param that we should set on the batch size for the train bs here)

Jun 29 '23 12:06 muellerzr

I can repro this so let me know if you need more logs. I'm trying to debug this myself too.

Jun 29 '23 14:06 orangetin

@orangetin can you tell us more about the deepspeed configuration you are using, how you are launching the script, and the args used? It looks like deepspeed isn't being properly set in the Accelerator hence the issue (or something on those lines). I have a feeling if you don't use deepspeed it will work

Jun 29 '23 14:06 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 29 '23 15:07 github-actions[bot]

@muellerzr Same here. Problem occurs only when per_device_train_batch_size is too large. but it's strange that when I used another tokenizer, things went right, and --auto_find_batch_size worked normally.

Here is my command to run run_clm.py(only a part of it) and deepspeed.config.

deepspeed --include localhost:4,5,6,7 run_clm.py --model_type gpt2    --do_train     --do_eval     --per_device_train_batch_size 8     --per_device_eval_batch_size 8     --auto_find_batch_size True     --gradient_accumulation_steps 16     --learning_rate 0.001 --fp16 False --fp16_full_eval False

{
    "fp16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.001,
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "train_micro_batch_size_per_gpu": "auto"
}

And my error track

  File "/xxxx/run_clm.py", line 679, in <module>
    main()
  File "/xxxx/run_clm.py", line 627, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxxx/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/xxxx/lib/python3.10/site-packages/accelerate/utils/memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
  File "/xxxx/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxxx/lib/python3.10/site-packages/transformers/trainer.py", line 2693, in training_step
    self.accelerator.backward(loss)
  File "/xxxxxxxxxxx/lib/python3.10/site-packages/accelerate/accelerator.py", line 1917, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
AttributeError: 'NoneType' object has no attribute 'backward'

Aug 29 '23 03:08 ekkkkki

Thanks @ekkkkki for the context. I must have missed this. @muellerzr is this enough to go on or would you like more details?

Aug 31 '23 13:08 orangetin

Thanks for the ping, I'll take a look at this today or tommorow!

Aug 31 '23 15:08 muellerzr

any updates for that bug? I can't run it even with batch_size of 2 or 8 (tried in sagemaker with ml.g5.12xlarge and ml.g4dn.12xlarge) I am out of ideas, even tried to go back to the commit in 21 Aug (which worked for me) and it doesn't (both transformers and accelerate) with deepspeed 0.10.0

Sep 10 '23 18:09 orellavie1212

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 11 '23 08:10 github-actions[bot]

cc @muellerzr are you still working on this?

Oct 20 '23 15:10 ArthurZucker

I think this ought to be reopened. BTW, it only happens to me when I try to finetune a Llama2 derivative but not when I finetune Mistral or Zephyr. Disabling auto_find_batch_size is indeed a workaround, but I'd really like to use that awesome feature.

Nov 22 '23 18:11 mhillebrand

cc @pacman100

Nov 22 '23 18:11 muellerzr

The problem seems to be that release_memory clears out the deepspeed_engine_wrapped attribute

https://github.com/huggingface/transformers/blob/35478182ce50d04bde5c4ecd0569c2f6ba15bee7/src/transformers/trainer.py#L1547

whenever we re-enter _inner_training_loop which would have been fine but once the model is already wrapped in the previous try, accelerator.prepare will not be called leaving accelerator.deepspeed_engine_wrapped None https://github.com/huggingface/transformers/blob/35478182ce50d04bde5c4ecd0569c2f6ba15bee7/src/transformers/trainer.py#L1655-L1660

Any hacks to get around this?

Dec 11 '23 19:12 chiragjn

Well for now I am resolving it like so

class HFTrainer(Trainer):
    def _inner_training_loop(self, batch_size=None, args=None, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None):
        # Hack to fix: https://github.com/huggingface/transformers/issues/24558
        if self.args.auto_find_batch_size:
            self.model_wrapped = self.model
            self.deepspeed = None
        return super()._inner_training_loop(batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

I am not entirely sure if this is correct or would even the right decision for Zero 3, but at least makes Zero 1 and 2 work with auto batch size finder

Dec 11 '23 19:12 chiragjn

So the workaround I was using worked fine for Llama 7B, but with Mistral 7B it is behaving weirdly, it seems to release memory on only one rank but not the other (using only 2 GPUs at the moment) and Trainer gets stuck completely

23291MiB
9439MiB

It seems like one rank went OOM and decided to adjust the batch to lower but the other rank didn't I'll debug some more but @pacman100 @sgugger can use some help from you for a proper fix 😅

My setup looks like this:

torch==2.1.1+cu118
transformers[accelerate,deepspeed,sentencepiece,tokenizers]==4.36.1
datasets==2.14.7
peft==0.6.2
bitsandbytes==0.41.3.post2

I am trying 4 bit qlora on 7B models with 2 GPUs

EDIT: On reading some code, when I enable this auto batch size finder, I wonder how do the ranks sync and agree on the same per-device batch size?

EDIT: It doesn't seem like the batch size gets correctly set in the deepspeed plugin once it is re-adjusted, so I am not sure if the optimizers, schedulers get initialized correctly 🤔

Dec 14 '23 19:12 chiragjn

Hello, there are a lot of things being discussed in this single issue.

I am trying 4 bit qlora on 7B models with 2 GPUs

I don't think qlora is supported with DeepSpeed.

The problem seems to be that release_memory clears out the deepspeed_engine_wrapped attribute transformers/src/transformers/trainer.py

Line 1547 in 3547818

self.accelerator.free_memory() whenever we re-enter _inner_training_loop which would have been fine but once the model is already wrapped in the previous try, accelerator.prepare will not be called leaving accelerator.deepspeed_engine_wrapped None

Thanks for providing more details. This is a niche issue and based on the available bandwidth, we will prioritize it.

Dec 15 '23 15:12 pacman100

Hello, there are a lot of things being discussed in this single issue.

Agreed, sorry for that 😅

I don't think qlora is supported with DeepSpeed.

Interesting would like a separate discussion for this. It seems to work fine with Zero 2 with static batch size - even compared the loss curves with DDP - they are almost the same. Theoretically, also it makes sense as only optimizer and gradients will be sharded which in qlora are only the trainable adapters in bfloat16/float16/float32. I have seen the community using axolotl also use it successfully. Zero 3 indeed does not work. Anyway, not the topic for this issue.

The only reason I brought that up here is because Deepspeed Zero sharding can cause uneven consumption on GPUs and the ranks can then disagree on batch sizes and everything gets stuck

Dec 15 '23 16:12 chiragjn

I don't think qlora is supported with DeepSpeed.

I use DeepSpeed (ZeRO-2) with both LoRA and QLoRA, and it works great—until I enable auto_find_batch_size.

Thanks for providing more details. This is a niche issue and based on the available bandwidth, we will prioritize it.

This is a niche issue? I feel like most people would rather make use of auto_find_batch_size and avoid OOM errors with ease. BTW, I was wrong. This problem does occur when finetuning Llama2 models.

Dec 15 '23 17:12 mhillebrand

I use DeepSpeed (ZeRO-2) with both LoRA and QLoRA, and it works great—until I enable auto_find_batch_size.

Nice, I meant DeepSpeed ZeRO 3 + QLoRA, should have been clear about it.

Dec 15 '23 18:12 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 09 '24 08:01 github-actions[bot]

I definitely don't want to see this issue marked as stale.

Jan 09 '24 14:01 mhillebrand

@mhillebrand it'll be closed after we merge #28088 which adds the support in for auto batch size finder :)

Jan 09 '24 14:01 muellerzr

@mhillebrand it'll be closed after we merge #28088 which adds the support in for auto batch size finder :)

Ah, I didn't see the linked PR from a month ago. Thank you!

Jan 09 '24 14:01 mhillebrand

transformers transformers copied to clipboard

Error when setting a high batch-size: `AttributeError: 'NoneType' object has no attribute 'backward'`

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard