[hotfix] fix parameter shape checking
📌 Checklist before creating the PR
- [ ] I have created an issue for this PR for traceability
- [ ] The title follows the standard format:
[doc/gemini/tensor/...]: A concise description - [ ] I have added relevant tags if possible for us to better distinguish different PRs
- [ ] I have installed pre-commit:
pip install pre-commit && pre-commit install
🚨 Issue number
Link this PR to your issue with words like fixed to automatically close the linked issue upon merge
e.g.
fixed #1234,closed #1234,resolved #1234
📝 What does this PR do?
Summarize your work here. if you have any plots/diagrams/screenshots/tables, please attach them here.
💥 Checklist before requesting a review
- [ ] I have linked my PR to an issue (instruction)
- [ ] My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
- [ ] I have performed a self-review of my code
- [ ] I have added thorough tests.
- [ ] I have added docstrings for all the functions/methods I implemented
⭐️ Do you enjoy contributing to Colossal-AI?
- [ ] 🌝 Yes, I do.
- [ ] 🌚 No, I don't.
Tell us more if you don't enjoy contributing to Colossal-AI.
why make_vocab_size_divisible_by is 64.
When tensor_parallel_size = 8, num_embedding = 128256, make_vocab_size_divisible_by * tensor_parallel_size = 64 * 8 = 512,
num_embeddings % multiple = 128256 % 512 != 0, so the num_embeddings is wrong!
class VocabParallelEmbedding1D(PaddingParallelModule):
r"""Embedding parallelized in the vocabulary dimension.
Args:
num_embeddings (int): number of embeddings.
embedding_dim (int): dimension of embedding.
padding_idx (int, optional): If specified, the entries at padding_idx do not contribute to the gradient;
therefore, the embedding vector at padding_idx is not updated during training,
i.e. it remains as a fixed “pad”, defaults to None.
dtype (:class:`torch.dtype`, optional): The dtype of parameters, defaults to None.
weight_initializer (:class:`typing.Callable`, optional):
he initializer of weight, defaults to normal initializer.
The ``args`` and ``kwargs`` used in :class:``torch.nn.functional.embedding`` should contain:
::
max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is
renormalized to have norm max_norm. Note: this will modify weight in-place.
norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2.
scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse
of frequency of the words in the mini-batch. Default False.
sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.
More details about ``args`` and ``kwargs`` could be found in
`Embedding <https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding.html#torch.nn.functional.embedding>`_.
More details about initializer please refer to
`init <https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/nn/init.py>`_.
"""
def __init__(
self,
num_embeddings: int,
embedding_dim: int,
padding_idx: int = None,
dtype: torch.dtype = None,
device: torch.device = None,
process_group: ProcessGroup = None,
weight: Optional[nn.Parameter] = None,
weight_initializer: Callable = init.normal_(),
make_vocab_size_divisible_by: int = 64,
fp8_communication: bool = False,
*args,
**kwargs,
):
self.num_embeddings = num_embeddings
self.embedding_dim = embedding_dim
self.embed_args = args
self.embed_kwargs = kwargs
self.process_group = process_group
self.fp8_communication = fp8_communication
tensor_parallel_size = dist.get_world_size(group=process_group)
tensor_parallel_rank = dist.get_rank(group=process_group)
# generate weight and bias
if weight is None:
factory_kwargs = {"device": device, "dtype": dtype}
weight = nn.Parameter(torch.empty((num_embeddings, self.embedding_dim), **factory_kwargs))
else:
weight.data = weight.data.to(device=device, dtype=dtype)
# calculate new padding size
multiple = make_vocab_size_divisible_by * tensor_parallel_size
if num_embeddings % multiple != 0:
self.num_embeddings = num_embeddings + multiple - (num_embeddings % multiple)
Propose to revise the following code: https://github.com/hpcaitech/ColossalAI/blob/a2596519fd8f25da13da622e5188b9a18024f3c0/colossalai/shardformer/layer/embedding.py Line 186 and Line 304
insert raise VallueError, and delete self.num_embeddings = ( num_embeddings + make_vocab_size_divisible_by - (num_embeddings % make_vocab_size_divisible_by)
if num_embeddings % make_vocab_size_divisible_by != 0:
raise VallueError
# self.num_embeddings = (
num_embeddings + make_vocab_size_divisible_by - (num_embeddings % make_vocab_size_divisible_by)
)
make_vocab_size_divisible_by
The make_vocab_size_divisible_by is used to ensure that the vocabulary size is divisible by 64. This is because torch.mm will select a faster operator when the size is a multiple of 64. Later, the result will be unpadded.
make_vocab_size_divisible_by
The make_vocab_size_divisible_by is used to ensure that the vocabulary size is divisible by 64. This is because torch.mm will select a faster operator when the size is a multiple of 64. Later, the result will be unpadded.
The issue has been closed. Could you please look at the #6110?