DeepSpeed Activation Checkpointing conflicts with Weight Sharing

Activation Checkpointing conflicts with Weight Sharing

Open iyupan opened this issue 3 years ago • 1 comments

Describe the bug I implement multiple transformer layers with only one-layer parameter (e.g., recursively use one layer six times to construct a 6-layer transformer), when I use activation checkpointing, there will be an AssertionError in Line 631, stage2.py.

To Reproduce This is the code that I used to call checkpoiting.

hidden_states = torch.utils.checkpoint.checkpoint(
                custom(l, l + self.checkpoint_num_layers),
                hidden_states, attention_mask, padding_mask, bias_encoder)

Expected behavior I expect normal running.

Unexpected behavior

AssertionError: The parameter 97 has already been reduced.             Gradient computed twice for this partition.             Multiple gradient reduction is currently not supported

Additional context deepspeed version: 0.3.16

Jul 18 '22 11:07 iyupan

@iyupan, thanks for reporting this issue.

To help investigate this, can you please provide repro steps?

Also, please clarify the expected behavior in this case. Should each parameter gradient be updated by accumulation six times for each backward pass?

Jul 27 '22 15:07 tjruwase

DeepSpeed DeepSpeed copied to clipboard

Activation Checkpointing conflicts with Weight Sharing

DeepSpeed
DeepSpeed copied to clipboard