DeepSpeed
                                
                                
                                
                                    DeepSpeed copied to clipboard
                            
                            
                            
                        Activation Checkpointing conflicts with Weight Sharing
Describe the bug I implement multiple transformer layers with only one-layer parameter (e.g., recursively use one layer six times to construct a 6-layer transformer), when I use activation checkpointing, there will be an AssertionError in Line 631, stage2.py.
To Reproduce This is the code that I used to call checkpoiting.
hidden_states = torch.utils.checkpoint.checkpoint(
                custom(l, l + self.checkpoint_num_layers),
                hidden_states, attention_mask, padding_mask, bias_encoder)
Expected behavior I expect normal running.
Unexpected behavior
AssertionError: The parameter 97 has already been reduced.             Gradient computed twice for this partition.             Multiple gradient reduction is currently not supported
Additional context deepspeed version: 0.3.16
@iyupan, thanks for reporting this issue.
To help investigate this, can you please provide repro steps?
Also, please clarify the expected behavior in this case. Should each parameter gradient be updated by accumulation six times for each backward pass?