DeepSpeed
DeepSpeed copied to clipboard
Errors with ZERO2 in the encoder-decoder model
Hi, I tried to implement the encoder-decoder model, namely BART https://github.com/huggingface/transformers/tree/master/src/transformers/models/bart, following the Megatron-LM tutorial. Everything works fine if I did not use ZERO2 for checkpointing. With ZERO2, the model crashes with the error of "gradient computed twice".
AssertionError: The parameter 195 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported
This is my configuration for zero2:
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"weight_decay": 1e-2
}
},
"zero_optimization": {
"stage": 2,
"cpu_offload": true,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 50000000,
"allgather_bucket_size": 500000000
},
"zero_allow_untested_optimizer": true,
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 100000000,
"hysteresis": 2,
"min_loss_scale": 1
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true
},
"wall_clock_breakdown": true
}
I tried to implement checkpointing in both the encoder and decoder, as shown below
def custom(start, end):
def custom_forward(*inputs):
layers = self.layers[start:end]
hidden_states = inputs[0]
for layer in layers:
if output_hidden_states:
encoder_states = encoder_states + (hidden_states,)
# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
dropout_probability = random.uniform(0, 1)
if self.training and (dropout_probability < self.layerdrop): # skip the layer
attn = None
else:
hidden_states, attn = layer(hidden_states, attention_mask, output_attentions=output_attentions)
if output_attentions:
all_attentions = all_attentions + (attn,)
return hidden_states
return custom_forward
if self.checkpoint_activations:
l = 0
num_layers = len(self.layers)
chunk_length = self.checkpoint_num_layers
while l < num_layers:
inputs = [hidden_states]
hidden_states = checkpoint(custom(l, l + chunk_length), *inputs)
l += chunk_length
but it seems that the system regards the encoder and decoder as two models, and the error still happens.
I would like to know the right way to checkpoint the encoder-decoder model, thx.
Does anybody solved this?
Run into the same issue for normal decoder only model
Run into the same issue for a loss = loss + custom_regular_terms(some param for model)
Same issue here, @JustinLin610 have you solved this issue?
Run into the same issue for a loss = loss + custom_regular_terms(some param for model)
I am facing the same problem. Have you solved this?
Changing to zero 1 solved this for me.
Run into the same issue for a loss = loss + custom_regular_terms(some param for model)
Changing to zero 1 solved this for me.
I have the same situation. First use zero3, unluckily found weight is empty because of the partitioned param, even safe gather the weight into one gpu, I guess error in gradient backward. Thus, I changed to the zero2 and met the unsupported multiple gradient reduction. Sad... maybe I need to try zero1.
I am facing the same problem and am eager to see the proposed solutions.
same here. Model is OpenAssistant/reward-model-deberta-v3-large-v2
Please explicitly set use_reentrant=False in the torch checkpointing function and that should solve the issue. https://pytorch.org/docs/stable/checkpoint.html
Please explicitly set use_reentrant=False in the torch checkpointing function and that should solve the issue. https://pytorch.org/docs/stable/checkpoint.html
@jomayeri That doesn't work in my situation. I am working with Transformers Trainer, passing gradient_checkpointing_kwargs like
training_args = TrainingArguments(
# Arguments
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant':False} # OR gradient_checkpointing_kwargs={'use_reentrant':True}
# Arguments
)
trainer = Trainer(args=training_args)
it raise warning:
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore
gradient_checkpointing_kwargsin case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method
_set_gradient_checkpointing in your model.
and also, changing zero2 to zero1 also not helping despite @liuchengyuan123's solution.
Any other suggestion on how to fix this? @jomayeri