DeepSpeed AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'backward'

Describe the bug Hello,I'm a novice using deepspeed. I used the ds_config.json but got outputs 'DeepSpeedZeRoOffload' object has no attribute 'backward' The file as follows, can anyone give some suggestions?Thanks in advance!

{
"train_batch_size":4,
"fp16": {
"enabled": true,
"autocast": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false,
"nvme_path" :"/home/tmp"
},
"offload_param": {
"device": "cpu",
"pin_memory": false,
"nvme_path" :"/home/tmp",
"buffer_size": 1e10,
"max_in_cpu": 1e9
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 5e8,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 5e8,
"stage3_max_reuse_distance": 5e8,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_micro_batch_size_per_gpu": 2,
"wall_clock_breakdown": false
}

To Reproduce Steps to reproduce the behavior:

See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/letrain/miniconda/envs/bloom/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0+cu102
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/letrain/miniconda/envs/bloom/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 10.2

Mar 15 '23 03:03 upwindflys

@upwindflys, are you trying to do training or inference? Can you share how to repro this, including command line and code?

Mar 15 '23 11:03 tjruwase

I am having the same problem. This happens while trying to run training with offloading enabled. I am using "Accelerate", however this doesn't seem to be isolated problem.

Mar 15 '23 23:03 WadRex

Are you passing an optimizer to deepspeed.initialize()? Can you share your code or steps to repro?

Mar 17 '23 17:03 tjruwase

@upwindflys or @WadRex, are you able to resolve this issue by passing optimizer to deepspeed.initialize()?

Mar 24 '23 12:03 tjruwase

Closing for lack of response. Please reopen as needed.

Mar 30 '23 14:03 tjruwase

Hello, I am also having the same issue. How did you solve it in the end?

May 11 '23 13:05 lizhidomg

Hi, I get the same issue too. my ds config :

{  
    "bf16": {  
        "enabled": "true"  
    },  
    "zero_optimization": {  
        "stage": 3,  
        "offload_optimizer": {  
            "device": "cpu",  
            "pin_memory": true  
        },  
        "offload_param": {  
            "device": "cpu",  
            "pin_memory": true  
        },  
        "overlap_comm": true,  
        "contiguous_gradients": true,  
        "reduce_bucket_size": "auto",  
        "stage3_prefetch_bucket_size": "auto",  
        "stage3_param_persistence_threshold": "auto",  
        "sub_group_size": 1e9,  
        "stage3_max_live_parameters": 1e9,  
        "stage3_max_reuse_distance": 1e9,  
        "stage3_gather_16bit_weights_on_model_save": "auto"  
    },  
    "gradient_accumulation_steps": 8,  
    "gradient_clipping": "auto",  
    "mixed_precision": "fp16",  
    "steps_per_print": 2000,  
    "train_batch_size": "auto",  
    "train_micro_batch_size_per_gpu": "auto",  
    "wall_clock_breakdown": false  
}

my train script:

for step, batch in enumerate(train_loader):  
    with accelerator.accumulate(model):  
        inputs = batch["input_ids"].to(accelerator.device)  
        targets = batch["labels"].to(accelerator.device)  
        model_output = model(input_ids=inputs, labels=targets, return_dict=True)  
        loss = model_output.loss  
        accelerator.backward(loss)

the error:

AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'backward'AttributeError

I'm trying to do training, and I didn't pass an optimizer to deepspeed.initialize(). How can I solve it?

Oct 28 '23 06:10 Muttermal

@Muttermal, you can pass an optimizer through ds_config as follows: https://www.deepspeed.ai/docs/config-json/#optimizer-parameters

Oct 30 '23 16:10 tjruwase

@tjruwase Thank you for your reply, I passed optimizer and scheduler to my ds_config. I use accelerate for training and I get a new error: #https://github.com/huggingface/transformers/issues/26148 This seems to be an issue with transformer or accelerate. Thank you for your reply.

Oct 31 '23 07:10 Muttermal

@Muttermal, the new issue should not exist with latest versions of those libraries, except if there is a recent regression. Can you please share your failing stack trace here?

Oct 31 '23 16:10 tjruwase