ColossalAI [hotfix] fix parameter shape checking

📌 Checklist before creating the PR

[ ] I have created an issue for this PR for traceability
[ ] The title follows the standard format: [doc/gemini/tensor/...]: A concise description
[ ] I have added relevant tags if possible for us to better distinguish different PRs
[ ] I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here. if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

[ ] I have linked my PR to an issue (instruction)
[ ] My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
[ ] I have performed a self-review of my code
[ ] I have added thorough tests.
[ ] I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

[ ] 🌝 Yes, I do.
[ ] 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

Nov 11 '24 10:11 flybird11111

why make_vocab_size_divisible_by is 64. When tensor_parallel_size = 8, num_embedding = 128256, make_vocab_size_divisible_by * tensor_parallel_size = 64 * 8 = 512, num_embeddings % multiple = 128256 % 512 != 0, so the num_embeddings is wrong!

class VocabParallelEmbedding1D(PaddingParallelModule):
    r"""Embedding parallelized in the vocabulary dimension.

    Args:
        num_embeddings (int): number of embeddings.
        embedding_dim (int): dimension of embedding.
        padding_idx (int, optional): If specified, the entries at padding_idx do not contribute to the gradient;
            therefore, the embedding vector at padding_idx is not updated during training,
            i.e. it remains as a fixed “pad”, defaults to None.
        dtype (:class:`torch.dtype`, optional): The dtype of parameters, defaults to None.
        weight_initializer (:class:`typing.Callable`, optional):
            he initializer of weight, defaults to normal initializer.

    The ``args`` and ``kwargs`` used in :class:``torch.nn.functional.embedding`` should contain:
    ::
        max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is
                    renormalized to have norm max_norm. Note: this will modify weight in-place.
        norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2.
        scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse
                    of frequency of the words in the mini-batch. Default False.
        sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.

    More details about ``args`` and ``kwargs`` could be found in
    `Embedding <https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding.html#torch.nn.functional.embedding>`_.

    More details about initializer please refer to
    `init <https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/nn/init.py>`_.
    """

    def __init__(
        self,
        num_embeddings: int,
        embedding_dim: int,
        padding_idx: int = None,
        dtype: torch.dtype = None,
        device: torch.device = None,
        process_group: ProcessGroup = None,
        weight: Optional[nn.Parameter] = None,
        weight_initializer: Callable = init.normal_(),
        make_vocab_size_divisible_by: int = 64,
        fp8_communication: bool = False,
        *args,
        **kwargs,
    ):
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        self.embed_args = args
        self.embed_kwargs = kwargs
        self.process_group = process_group
        self.fp8_communication = fp8_communication

        tensor_parallel_size = dist.get_world_size(group=process_group)
        tensor_parallel_rank = dist.get_rank(group=process_group)

        # generate weight and bias
        if weight is None:
            factory_kwargs = {"device": device, "dtype": dtype}
            weight = nn.Parameter(torch.empty((num_embeddings, self.embedding_dim), **factory_kwargs))
        else:
            weight.data = weight.data.to(device=device, dtype=dtype)

        # calculate new padding size
        multiple = make_vocab_size_divisible_by * tensor_parallel_size
        if num_embeddings % multiple != 0:
            self.num_embeddings = num_embeddings + multiple - (num_embeddings % multiple)

Nov 13 '24 06:11 cingtiye

Propose to revise the following code: https://github.com/hpcaitech/ColossalAI/blob/a2596519fd8f25da13da622e5188b9a18024f3c0/colossalai/shardformer/layer/embedding.py Line 186 and Line 304

insert raise VallueError, and delete self.num_embeddings = ( num_embeddings + make_vocab_size_divisible_by - (num_embeddings % make_vocab_size_divisible_by)

if num_embeddings % make_vocab_size_divisible_by != 0:
              raise VallueError
#            self.num_embeddings = (
                num_embeddings + make_vocab_size_divisible_by - (num_embeddings % make_vocab_size_divisible_by)
            )

Nov 13 '24 07:11 cingtiye

make_vocab_size_divisible_by

The make_vocab_size_divisible_by is used to ensure that the vocabulary size is divisible by 64. This is because torch.mm will select a faster operator when the size is a multiple of 64. Later, the result will be unpadded.

Nov 18 '24 06:11 flybird11111

make_vocab_size_divisible_by

The make_vocab_size_divisible_by is used to ensure that the vocabulary size is divisible by 64. This is because torch.mm will select a faster operator when the size is a multiple of 64. Later, the result will be unpadded.

The issue has been closed. Could you please look at the #6110?

Nov 18 '24 06:11 cingtiye