DeepSpeed Errors with ZERO2 in the encoder-decoder model

Hi, I tried to implement the encoder-decoder model, namely BART https://github.com/huggingface/transformers/tree/master/src/transformers/models/bart, following the Megatron-LM tutorial. Everything works fine if I did not use ZERO2 for checkpointing. With ZERO2, the model crashes with the error of "gradient computed twice". AssertionError: The parameter 195 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

This is my configuration for zero2:

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "weight_decay": 1e-2
    }
  },
  "zero_optimization": {
    "stage": 2,
    "cpu_offload": true,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true,  
    "reduce_bucket_size": 50000000,
    "allgather_bucket_size": 500000000
  },
  "zero_allow_untested_optimizer": true,
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 100000000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true,
    "cpu_checkpointing": true
  },
  "wall_clock_breakdown": true
}

I tried to implement checkpointing in both the encoder and decoder, as shown below

 def custom(start, end):
      def custom_forward(*inputs):
          layers = self.layers[start:end]
          hidden_states = inputs[0]
          for layer in layers:
              if output_hidden_states:
                  encoder_states = encoder_states + (hidden_states,)
              # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
              dropout_probability = random.uniform(0, 1)
              if self.training and (dropout_probability < self.layerdrop):  # skip the layer
                  attn = None
              else:
                  hidden_states, attn = layer(hidden_states, attention_mask, output_attentions=output_attentions)
              if output_attentions:
                  all_attentions = all_attentions + (attn,)
          return hidden_states
      return custom_forward
  if self.checkpoint_activations:
      l = 0
      num_layers = len(self.layers)
      chunk_length = self.checkpoint_num_layers
      while l < num_layers:
          inputs = [hidden_states]
          hidden_states = checkpoint(custom(l, l + chunk_length), *inputs)
          l += chunk_length

but it seems that the system regards the encoder and decoder as two models, and the error still happens.

I would like to know the right way to checkpoint the encoder-decoder model, thx.

Jan 04 '21 03:01 JustinLin610

Does anybody solved this?

Nov 09 '21 13:11 petrgeiger-incieve

Run into the same issue for normal decoder only model

Mar 17 '23 06:03 congchan

Run into the same issue for a loss = loss + custom_regular_terms(some param for model)

May 07 '23 13:05 CheungZeeCn

Same issue here, @JustinLin610 have you solved this issue?

May 08 '23 02:05 uygnef

Run into the same issue for a loss = loss + custom_regular_terms(some param for model)

I am facing the same problem. Have you solved this?

Nov 24 '23 03:11 liuchengyuan123

Changing to zero 1 solved this for me.

Dec 13 '23 09:12 liuchengyuan123

Run into the same issue for a loss = loss + custom_regular_terms(some param for model)

Changing to zero 1 solved this for me.

I have the same situation. First use zero3, unluckily found weight is empty because of the partitioned param, even safe gather the weight into one gpu, I guess error in gradient backward. Thus, I changed to the zero2 and met the unsupported multiple gradient reduction. Sad... maybe I need to try zero1.

Jan 19 '24 11:01 zouyingcao

I am facing the same problem and am eager to see the proposed solutions.

Mar 20 '24 02:03 patrick-tssn

same here. Model is OpenAssistant/reward-model-deberta-v3-large-v2

Mar 20 '24 04:03 chongxiaoc

Please explicitly set use_reentrant=False in the torch checkpointing function and that should solve the issue. https://pytorch.org/docs/stable/checkpoint.html

Mar 22 '24 17:03 jomayeri

Please explicitly set use_reentrant=False in the torch checkpointing function and that should solve the issue. https://pytorch.org/docs/stable/checkpoint.html

@jomayeri That doesn't work in my situation. I am working with Transformers Trainer, passing gradient_checkpointing_kwargs like

training_args = TrainingArguments(
        # Arguments
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={'use_reentrant':False} # OR gradient_checkpointing_kwargs={'use_reentrant':True} 
        # Arguments
)
trainer = Trainer(args=training_args)

it raise warning: You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargsin case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method_set_gradient_checkpointing in your model.

and also, changing zero2 to zero1 also not helping despite @liuchengyuan123's solution.

Any other suggestion on how to fix this? @jomayeri

Apr 01 '24 13:04 iFe1er

DeepSpeed DeepSpeed copied to clipboard

Errors with ZERO2 in the encoder-decoder model

DeepSpeed
DeepSpeed copied to clipboard