DeepSpeed
DeepSpeed copied to clipboard
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
Hi, I want use DeepSpeed to speed my transformer , and I came across such problem:
File "main.py", line 460, in <module>
main(args)
File "main.py", line 392, in main
train_stats = train_one_epoch(
File "/opt/ml/code/deepspeed/engine.py", line 57, in train_one_epoch
loss_scaler(loss, optimizer, clip_grad=clip_grad, clip_mode=clip_mode,
File "/usr/local/lib/python3.8/dist-packages/timm/utils/cuda.py", line 43, in __call__
self._scaler.scale(loss).backward(create_graph=create_graph)
File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 661, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 724, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
My config.json is as follows:
{
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu":1,
"steps_per_print": 100,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00001,
"weight_decay": 1e-2
}
},
"flops_profiler": {
"enabled": false,
"profile_step": 100,
"module_depth": -1,
"top_modules": 3,
"detailed": true
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 18,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 1,
"cpu_offload": false,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size":1e8,
"allgather_bucket_size": 5e8
},
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": true
}
Hi @TianhaoFu can you share ds_report with me? I am curious on what deepspeed version or commit hash you were on. I am trying to reproduce your issue.
Also if this issue is quick to reproduce can you also try with "stage": 2?
config:
{
"zero_optimization": {
"stage": 1,
"overlap_comm": true
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"train_batch_size": 8,
"steps_per_print": 4000,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"adam_w_mode": true,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 10000,
"total_num_steps": 100000
}
},
"wall_clock_breakdown": false
}
get error:
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 251, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 146, in backward
Variable._execution_engine.run_backward(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 664, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1109, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 726, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
Env:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
torch version .................... 1.8.0a0+17f8c32
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.4.4+6ba9628, 6ba9628, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
Hi @chrjxj, can you try setting "stage": 1, in your config json to "stage": 2,? I want to confirm if your issue occurs with both stages of zero. I am unable to reproduce the error on my side yet.
Actually @chrjxj can you set these both to false in your config? I suspect this will fix your issues.
"contiguous_gradients": false,
"overlap_comm": false,
@jeffra thanks. it still doesn't work and throw out new error msg.
@chrjxj, can you provide the stack trace for the new error message?
Hi @chrjxj, did you find a solution?
@antoiloui, are you also seeing this error? Can you share the deepspeed version you are using and the stack trace? Did you also try turning off contiguous_gradients and overlap_comm?
Hi @jeffra, yes I'm experiencing the same issue. Here is the error I get:
File "/root/envs/star/lib/python3.8/site-packages/grad_cache/grad_cache.py", line 242, in forward_backward
surrogate.backward()
File "/root/envs/star/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/envs/star/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1250, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 826, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
And here is my config file:
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": false,
"contiguous_gradients": false
},
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
Gotcha, I see. Thank you @antoiloui. What version of deepspeed are you running?
Is it possible to provide a repro for this error that you're seeing?
Hi @chrjxj, did you find a solution?
no... switched to other tasks...
Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.
Here is the error I get:
Traceback (most recent call last):
File "/root/QuickDraw/train.py", line 244, in <module>
train(opt)
File "/root/QuickDraw/train.py", line 165, in train
torch.autograd.backward(loss)
File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
this is my deepspeed_config file:
{
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 4,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true
},
"steps_per_print": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001
}
}
}
Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.
Here is the error I get:
Traceback (most recent call last): File "/root/QuickDraw/train.py", line 244, in <module> train(opt) File "/root/QuickDraw/train.py", line 165, in train torch.autograd.backward(loss) File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow( AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'this is my deepspeed_config file:
{ "train_batch_size": 32, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 4, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true }, "steps_per_print": 1, "optimizer": { "type": "Adam", "params": { "lr": 0.001 } } }
"stage": 2 > "stage":1 Solved
Well, let me join this thread too.. Have the same issue as described above
The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py
Configuration I use
{
"zero_allow_untested_optimizer": True,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": True,
"overlap_comm": True,
"allgather_partitions": True,
"reduce_scatter": True,
"allgather_bucket_size": 200000000,
"reduce_bucket_size": 200000000,
"sub_group_size": 1000000000000,
},
"activation_checkpointing": {
"partition_activations": False,
"cpu_checkpointing": False,
"contiguous_memory_optimization": False,
"synchronize_checkpoint_boundary": False,
},
"aio": {
"block_size": 1048576,
"queue_depth": 8,
"single_submit": False,
"overlap_events": True,
"thread_count": 1,
},
"gradient_clipping": 1.0,
"gradient_accumulation_steps": 1,
"bf16": {"enabled": True},
}
Traceback:
Traceback (most recent call last):
File "train.py", line 367, in <module>
trainer.run(m_cfg, train_dataset, None, tconf)
File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
return self._strategy.launcher.launch(run_method, *args, **kwargs)
File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
return run_method(*args, **kwargs)
File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
run_epoch('train')
File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
self.backward(loss)
File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
self._precision.backward(tensor, module, *args, **kwargs)
File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
tensor.backward(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
Well, let me join this thread too.. Have the same issue as described above
The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py
Configuration I use
{ "zero_allow_untested_optimizer": True, "zero_optimization": { "stage": 2, "contiguous_gradients": True, "overlap_comm": True, "allgather_partitions": True, "reduce_scatter": True, "allgather_bucket_size": 200000000, "reduce_bucket_size": 200000000, "sub_group_size": 1000000000000, }, "activation_checkpointing": { "partition_activations": False, "cpu_checkpointing": False, "contiguous_memory_optimization": False, "synchronize_checkpoint_boundary": False, }, "aio": { "block_size": 1048576, "queue_depth": 8, "single_submit": False, "overlap_events": True, "thread_count": 1, }, "gradient_clipping": 1.0, "gradient_accumulation_steps": 1, "bf16": {"enabled": True}, }Traceback:
Traceback (most recent call last): File "train.py", line 367, in <module> trainer.run(m_cfg, train_dataset, None, tconf) File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl return self._strategy.launcher.launch(run_method, *args, **kwargs) File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, **kwargs) File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup return run_method(*args, **kwargs) File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run run_epoch('train') File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch self.backward(loss) File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward self._precision.backward(tensor, module, *args, **kwargs) File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward tensor.backward(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward torch.autograd.backward( File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel()) AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
Try changing 'stage' from 2 to 1 in the configuration. Does it still have the same problem? I understand that this improves learning efficiency by partitioning parameters when learning a large model, but in my case, this solved the problem.
The official document describes the stage as follows:
Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.
I have solved my problem by choosing right combination of python version and packages versions. If someone is interested in it:
- python 3.8
- torch==2.0.0
- deepspeed==0.9.1
- pytorch-lightning==1.9.1
You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.
I have solved my problem by choosing right combination of python version and packages versions. If someone is interested in it:
* python 3.8 * torch==2.0.0 * deepspeed==0.9.1 * pytorch-lightning==1.9.1You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.
This method can't solve my problem. I am also studying RWKV. Can you help me? My problem is that: Traceback (most recent call last): File "/data1/RWKV-LM/RWKV-v4/train.py", line 280, in trainer.run(m_cfg, train_dataset, None, tconf) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 628, in _run_impl return self._strategy.launcher.launch(run_method, *args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/strategies/launchers/subprocess_script.py", line 90, in launch return function(*args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 638, in _run_with_setup return run_function(*args, **kwargs) File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run run_epoch('train') File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch self.backward(loss) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 359, in backward self._precision.backward(tensor, module, *args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/plugins/precision/precision.py", line 73, in backward tensor.backward(*args, **kwargs) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel()) AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
@maomao279 Have you tried v4neo? + are you sure you use the same versions during the run and which cuda version do you use (not sure the last is important, just want to know)?
I got the same issue. But fixed by remove a redundant backward.
outputs = model_engine(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
# loss.backward() # remove this line
model_engine.backward(loss)
model_engine.step()
And this code is from chatGPT, so it is excusable.
Has somebody found a solution other than using different package versions or changing to stage 1. I need stage 2 to work unfortunately and can not downgrade the package versions due to dependencies. Help really appreciated.
any idea? I got a similar bug: AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'
I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().
I solved it by using
DeepSpeedEngine.backward(loss)andDeepSpeedEngine.step()not torch nativeloss.backward()andoptimizer.step().
Thanks for sharing this update. Can you clarify that you were seeing the same error as the original post?
Also, was your code following this guide for model porting: https://www.deepspeed.ai/getting-started/#writing-deepspeed-models