ColossalAI
ColossalAI copied to clipboard
[BUG]: Parameters do not be updated under tensor parallel
π Describe the bug
Hello there,
Thanks for this awesome project.
I am currently training a GPT2 model for contrastive learning InfoNCE loss using tensor parallelism. To implement the training codebase, I followed the GPT2_Gemini example.
However, I encountered an issue while using tensor parallelism with a degree of 2, as the parameters were not updating successfully. Nonetheless, upon switching to a degree of 1 with only data parallelism, I was able to successfully update the parameters and achieve a significant decrease in loss.
Can anyone help me to point out how to fix this issue? Big thanks!
I calculate the infoNCE loss with the following codebase:
def calculate_in_batch_contrastive_loss(x):
x = torch.cat(GatherLayer.apply(x), dim=0)
query_x = x.unsqueeze(0)
key_x = x.unsqueeze(1)
cos_similarity = F.cosine_similarity(query_x, key_x, -1) * self.temperature
labels = torch.arange(cos_similarity.size(0)).cuda()
loss = F.cross_entropy(cos_similarity, labels)
return loss
To gather the data from each process and calculate the infoNCE loss, I apply this GaterLayer.
class GatherLayer(torch.autograd.Function):
@staticmethod
def forward(ctx, input):
ctx.save_for_backward(input)
output = [torch.zeros_like(input) \
for _ in range(dist.get_world_size())
]
dist.all_gather(output, input)
return tuple(output)
@staticmethod
def backward(ctx, *grads):
input, = ctx.saved_tensors
grad_out = torch.zeros_like(input)
grad_out[:] = grads[dist.get_rank()]
return grad_out
Environment
[GPU]
RTX 3090
RTX 4090
[CUDA]
CUDA == 11.6
[Python package]
colossalai == 0.2.7
torch == 1.13.1
Hey, how did you write your tensor_parallelize function if you followed our gpt2 example?
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Hey, how did you write your tensor parallelize function if you followed your gpt2 example?
Hey, how did you write your tensor_parallelize function if you followed our gpt2 example?
there is a tensor_parallelize func in gpt exmaple, when it needs people to implement their own tensor_parallelize? @JThh
@JThh Hi, thanks for your response!
I follow the tensor_parallelize
function in the example because I also use the same gpt2 model (hugging face version).
Can you success update the parameter to decrease the loss with the example code?
Hi @eric8607242 , I guess the reason is, if a tensor is all-gathered in the forward pass, its gradient should be reduce-scattered rather than simply sliced.
Hi @kurisusnowdeng, Thanks for your response. I will try to address the issue in this direction!