Ronghang Hu comments

Results 64 comments of


                                            Ronghang Hu

Cast liner model to bf16 produces unexpected f32

I see. @JackCaoG Thanks for looking into it! > I think the cast in the backward was a result of the cast in the forward(grad is in bf16 through, I...

Cast liner model to bf16 produces unexpected f32

And this cast to f32 in the backward pass not only happens in the first iteration but also in subsequence iterations if we try to do forward and backward multiple...

Cast liner model to bf16 produces unexpected f32

> This seems to be why the casting is happening, I am guessing this is the backward of some op? Yeah, I think this `aten::permute(%5)` is the autograd generated backward...

Cast liner model to bf16 produces unexpected f32

> mark_step only affect pytorch/xla view of how tensor is stored, it does not affect autograd engine which is a layer above the pytorch/xla. @JackCaoG Yeah, I was aware of...

Cast liner model to bf16 produces unexpected f32

@hjm-aws @JackCaoG @soulitzer OK, I think I figured out the underlying cause of the issue above. It's because `nn.Module` by default cast a parameter to another dtype by directly assigning...

Cast liner model to bf16 produces unexpected f32

> I'd point out that metadata can only go out-of-date when p is not a leaf node. When p is a leaf node, the following should take care of updating...

gradient checkpoint cause bigger memory usage on GPU

@JackCaoG I see, thanks for the update on this!

gradient checkpoint cause bigger memory usage on GPU

Thanks @JackCaoG -- I'll try this out!

Autograd discrepancy in `nn.Linear` (`torch.nn.functional.linear`) between native PyTorch and PyTorch/XLA

A workaround to address both this issue and https://github.com/pytorch/xla/issues/3718 is to add the following snippet before the model definition code (in distributed training, it needs to be added to each...

Autograd discrepancy in `nn.Linear` (`torch.nn.functional.linear`) between native PyTorch and PyTorch/XLA

I guess the behavior difference above could be related to the different implementations of `aten::t` between native PyTorch and PyTorch/XLA, which is used in `weight.t()` in the `torch.nn.functional.linear`'s underlying [`aten:linear`...