Gu Wei
Gu Wei
why should class LlamaRMSNorm do ”hidden_states = hidden_states.to(torch.float32)“ ,why not flow the type promotion rules of PyToch ops
self.weight is bf16,hidden_states is fp32 I found that the dtype of these two methods are different. method 1: return (self.weight * hidden_states).to(input_dtype) # (bf16 * fp32).to(input_dtype) method 2: return self.weight...
I had the same problem and was very confused
Is this a bug or a problem with incorrect usage? I didn't find any relevant instructions in the community documentation.
https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L3737 ``// This means there is not yet a NCCL collective being called // Here we have to use the best guesses and will use a single GPU to call...
@mal @ezyang This minimum case is a real scene constructed from the training model. Does the comment in the code mean that it is used incorrectly?
> The reason for the hang is complicated and yes, it is related to the code you refer to (guessing device). > > There are two ways to workaround it:...