Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] Adding a new parameter in ColumnParallelLinear/RowParallelLinear raises Error

Open haolibai opened this issue 5 months ago • 0 comments

Hi, I am trying to add some new learnable parameters inside ColumnParallelLinear/RowParallelLinear, and the following is an example code snippet:

class ColumnParallelLinear(torch.nn.Module):
    """Linear layer with column parallelism.

    The linear layer is defined as Y = XA + b. A is parallelized along
    its second dimension as A = [A_1, ..., A_p].

    Args:
        input_size: first dimension of matrix A.
        output_size: second dimension of matrix A.
        bias: If true, add bias
        gather_output: If true, call all-gather on output and make Y available to all GPUs, otherwise, every GPU will have its output which is Y_i = XA_i
        init_method: method to initialize weights. Note that bias is always set to zero.
        stride: For the strided linear layers.
        keep_master_weight_for_test: This was added for testing and should be set to False. It returns the master weights used for initialization.
        skip_bias_add: If True, do not add the bias term, instead return it to be added by the caller. This enables performance optimations where bias can be fused with other elementwise operations.
        skip_weight_param_allocation: If True, weight parameter is not allocated and must be passed as a keyword argument `weight` during the forward pass. Note that this does not affect bias, which will be allocated if bias is True. Defaults to False.
        embedding_activation_buffer: This buffer holds the input activations of the final embedding linear layer on the last pipeline stage when defer_embedding_wgrad_compute is enabled.
        grad_output_buffer: This buffer holds the gradient outputs of the final embedding linear layer on the last pipeline stage when defer_embedding_wgrad_compute is enabled.
        is_expert: If True, the layer is treated as an MoE expert layer.
        config: ModelParallelConfig object
        tp_comm_buffer_name: Communication buffer name is not used in non-Transformer-Engine modules.
    """

    def __init__(
        self,
        input_size,
        output_size,
        *,
        config: ModelParallelConfig,
        init_method: Callable,
        bias=True,
        gather_output=False,
        stride=1,
        keep_master_weight_for_test=False,
        skip_bias_add=False,
        skip_weight_param_allocation: bool = False,
        embedding_activation_buffer: Optional[List[torch.Tensor]] = None,
        grad_output_buffer: Optional[List[torch.Tensor]] = None,
        is_expert: bool = False,
        tp_comm_buffer_name: str = None,  # Not used
    ):
        super(ColumnParallelLinear, self).__init__()
        ...
        # NOTE: a new parameter defined here (for example)
        self.new_param = Parameter(torch.randn(config.hidden_size, dtype=config.params_dtype, device=torch.cuda.current_device()))
    ))

    def forward(self, input_: torch.Tensor, weight: Optional[torch.Tensor] = None):
        ...
        output = output * self.new_param
        return output

However, this gives me the following error during training.

Traceback (most recent call last):
  File "/home/ma-user/work/haoli/code/PanGu/pretrain_gpt.py", line 343, in main
    pretrain(train_valid_test_datasets_provider,
  File "/home/ma-user/work/haoli/code/PanGu/pangu/training/training.py", line 271, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ma-user/work/haoli/code/PanGu/pangu/training/training.py", line 441, in train
    train_step(forward_step_func,
  File "/home/ma-user/work/haoli/code/third_party/Megatron-LM/megatron/training/training.py", line 553, in train_step
    losses_reduced = forward_backward_func(
  File "/home/ma-user/work/haoli/code/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 395, in forward_backward_no_pipelining
    config.finalize_model_grads_func([model])
  File "/home/ma-user/work/haoli/code/third_party/Megatron-LM/megatron/core/distributed/finalize_model_grads.py", line 135, in finalize_model_grads
    model_chunk.finish_grad_sync()
  File "/home/ma-user/work/haoli/code/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 242, in finish_grad_sync
    buffer.finish_grad_sync()
  File "/home/ma-user/work/haoli/code/third_party/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 512, in finish_grad_sync
    bucket.finish_grad_sync()
Traceback (most recent call last):
  File "/home/ma-user/work/haoli/code/pangu_sophon_pytorch/third_party/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 157, in finish_grad_sync
    assert self.communication_handle is not None and self.communication_issued, (
AssertionError: Communication call has not been issued for this bucket (1/2 params have grad available)

It seems the newly added parameter is not counted into self.params_with_grad.

However, the training goes normal when I do the same procedure in other places, e.g., the init fucntion of ParallelAttention, or ParallelMLP, with no such errors.

haolibai avatar Sep 19 '24 13:09 haolibai