DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Open TianhaoFu opened this issue 2 years ago • 18 comments

Hi, I want use DeepSpeed to speed my transformer , and I came across such problem:

  File "main.py", line 460, in <module>
    main(args)
  File "main.py", line 392, in main
    train_stats = train_one_epoch(
  File "/opt/ml/code/deepspeed/engine.py", line 57, in train_one_epoch
    loss_scaler(loss, optimizer, clip_grad=clip_grad, clip_mode=clip_mode,
  File "/usr/local/lib/python3.8/dist-packages/timm/utils/cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 661, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 724, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

My config.json is as follows:

{
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu":1,
  "steps_per_print": 100,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00001,
      "weight_decay": 1e-2
    }
  },
  "flops_profiler": {
    "enabled": false,
    "profile_step": 100,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 18,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
      "stage": 1,
      "cpu_offload": false,
      "contiguous_gradients": true,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size":1e8,
      "allgather_bucket_size": 5e8

  },
  "activation_checkpointing": {
      "partition_activations": false,
      "contiguous_memory_optimization": false,
      "cpu_checkpointing": false
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}

TianhaoFu avatar Jul 12 '21 09:07 TianhaoFu

Hi @TianhaoFu can you share ds_report with me? I am curious on what deepspeed version or commit hash you were on. I am trying to reproduce your issue.

Also if this issue is quick to reproduce can you also try with "stage": 2?

jeffra avatar Jul 14 '21 16:07 jeffra

config:

{
  "zero_optimization": {
    "stage": 1,
    "overlap_comm": true
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
},


  "train_batch_size": 8,
  "steps_per_print": 4000,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "adam_w_mode": true,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 10000,
      "total_num_steps": 100000
    }
  },
  "wall_clock_breakdown": false
}



get error:

  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 251, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 146, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 664, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1109, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 726, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Env:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
torch version .................... 1.8.0a0+17f8c32
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.4.4+6ba9628, 6ba9628, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

chrjxj avatar Jul 23 '21 16:07 chrjxj

Hi @chrjxj, can you try setting "stage": 1, in your config json to "stage": 2,? I want to confirm if your issue occurs with both stages of zero. I am unable to reproduce the error on my side yet.

jeffra avatar Jul 23 '21 17:07 jeffra

Actually @chrjxj can you set these both to false in your config? I suspect this will fix your issues.

      "contiguous_gradients": false,
      "overlap_comm": false,

jeffra avatar Jul 23 '21 20:07 jeffra

@jeffra thanks. it still doesn't work and throw out new error msg.

chrjxj avatar Jul 25 '21 08:07 chrjxj

@chrjxj, can you provide the stack trace for the new error message?

jeffra avatar Jul 26 '21 15:07 jeffra

Hi @chrjxj, did you find a solution?

ant-louis avatar Jun 01 '22 14:06 ant-louis

@antoiloui, are you also seeing this error? Can you share the deepspeed version you are using and the stack trace? Did you also try turning off contiguous_gradients and overlap_comm?

jeffra avatar Jun 01 '22 17:06 jeffra

Hi @jeffra, yes I'm experiencing the same issue. Here is the error I get:

  File "/root/envs/star/lib/python3.8/site-packages/grad_cache/grad_cache.py", line 242, in forward_backward
    surrogate.backward()
  File "/root/envs/star/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/envs/star/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1250, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 826, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

And here is my config file:

{
    "zero_optimization": {
       "stage": 2,
       "offload_optimizer": {
           "device": "cpu",
           "pin_memory": true
       },
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "overlap_comm": false,
       "contiguous_gradients": false
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

ant-louis avatar Jun 01 '22 17:06 ant-louis

Gotcha, I see. Thank you @antoiloui. What version of deepspeed are you running?

Is it possible to provide a repro for this error that you're seeing?

jeffra avatar Jun 01 '22 18:06 jeffra

Hi @chrjxj, did you find a solution?

no... switched to other tasks...

chrjxj avatar Jun 20 '22 02:06 chrjxj

Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' 

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

heojeongyun avatar Mar 08 '23 06:03 heojeongyun

Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' 

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

"stage": 2 > "stage":1 Solved

heojeongyun avatar Mar 08 '23 08:03 heojeongyun

Well, let me join this thread too.. Have the same issue as described above

The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

AlexKay28 avatar Apr 20 '23 15:04 AlexKay28

Well, let me join this thread too.. Have the same issue as described above

The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Try changing 'stage' from 2 to 1 in the configuration. Does it still have the same problem? I understand that this improves learning efficiency by partitioning parameters when learning a large model, but in my case, this solved the problem.

The official document describes the stage as follows:

Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.

heojeongyun avatar Apr 24 '23 03:04 heojeongyun

I have solved my problem by choosing right combination of python version and packages versions. If someone is interested in it:

  • python 3.8
  • torch==2.0.0
  • deepspeed==0.9.1
  • pytorch-lightning==1.9.1

You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.

AlexKay28 avatar Apr 24 '23 20:04 AlexKay28

I have solved my problem by choosing right combination of python version and packages versions. If someone is interested in it:

* python 3.8

* torch==2.0.0

* deepspeed==0.9.1

* pytorch-lightning==1.9.1

You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.

This method can't solve my problem. I am also studying RWKV. Can you help me? My problem is that: Traceback (most recent call last): File "/data1/RWKV-LM/RWKV-v4/train.py", line 280, in trainer.run(m_cfg, train_dataset, None, tconf) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 628, in _run_impl return self._strategy.launcher.launch(run_method, *args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/strategies/launchers/subprocess_script.py", line 90, in launch return function(*args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 638, in _run_with_setup return run_function(*args, **kwargs) File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run run_epoch('train') File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch self.backward(loss) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 359, in backward self._precision.backward(tensor, module, *args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/plugins/precision/precision.py", line 73, in backward tensor.backward(*args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel()) AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Chain-Mao avatar Apr 25 '23 12:04 Chain-Mao

@maomao279 Have you tried v4neo? + are you sure you use the same versions during the run and which cuda version do you use (not sure the last is important, just want to know)?

AlexKay28 avatar Apr 26 '23 12:04 AlexKay28

I got the same issue. But fixed by remove a redundant backward.

        outputs = model_engine(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        # loss.backward()   # remove this line
        model_engine.backward(loss)
        model_engine.step()

And this code is from chatGPT, so it is excusable.

dabney777 avatar Jun 26 '23 13:06 dabney777

Has somebody found a solution other than using different package versions or changing to stage 1. I need stage 2 to work unfortunately and can not downgrade the package versions due to dependencies. Help really appreciated.

SophieOstmeier avatar Sep 13 '23 18:09 SophieOstmeier

any idea? I got a similar bug: AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

catqaq avatar Sep 17 '23 16:09 catqaq

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

workingloong avatar Dec 13 '23 11:12 workingloong

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

Thanks for sharing this update. Can you clarify that you were seeing the same error as the original post?

Also, was your code following this guide for model porting: https://www.deepspeed.ai/getting-started/#writing-deepspeed-models

tjruwase avatar Dec 13 '23 15:12 tjruwase