Ronghang Hu

Results 64 comments of Ronghang Hu

I see. @JackCaoG Thanks for looking into it! > I think the cast in the backward was a result of the cast in the forward(grad is in bf16 through, I...

And this cast to f32 in the backward pass not only happens in the first iteration but also in subsequence iterations if we try to do forward and backward multiple...

> This seems to be why the casting is happening, I am guessing this is the backward of some op? Yeah, I think this `aten::permute(%5)` is the autograd generated backward...

> mark_step only affect pytorch/xla view of how tensor is stored, it does not affect autograd engine which is a layer above the pytorch/xla. @JackCaoG Yeah, I was aware of...

@hjm-aws @JackCaoG @soulitzer OK, I think I figured out the underlying cause of the issue above. It's because `nn.Module` by default cast a parameter to another dtype by directly assigning...

> I'd point out that metadata can only go out-of-date when p is not a leaf node. When p is a leaf node, the following should take care of updating...

@JackCaoG I see, thanks for the update on this!

Thanks @JackCaoG -- I'll try this out!

A workaround to address both this issue and https://github.com/pytorch/xla/issues/3718 is to add the following snippet before the model definition code (in distributed training, it needs to be added to each...

I guess the behavior difference above could be related to the different implementations of `aten::t` between native PyTorch and PyTorch/XLA, which is used in `weight.t()` in the `torch.nn.functional.linear`'s underlying [`aten:linear`...