Natalia Gimelshein comments

Results 214 comments of


                                            Natalia Gimelshein

Recent PyTorch nightly builds breaks DALLE2_pytorch model in Torchbench

I think this line is the problem https://github.com/pytorch/pytorch/pull/89485/files#diff-b5faaeef4cddee9a195a6ca3c652be163f38d4fc1b31d0b42ed5944cb41ab67fR138, @xuzhao9 can you try either reverting that PR or just that line and see if it fixes the problem?

Recent PyTorch nightly builds breaks DALLE2_pytorch model in Torchbench

And also this change https://github.com/pytorch/pytorch/pull/89485/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8cL1171 doesn't work for GPU

Recent PyTorch nightly builds breaks DALLE2_pytorch model in Torchbench

Yeah the author modified cpu implementation only, but made changes to the common path, so now gpu is getting discontiguous gradients where previously it was guaranteed to get contiguous only....

torch.where + DDPoptimizer + Dynamo causes faketensor error

`where.scalar` is a CompositeImplicitAutograd function https://github.com/pytorch/pytorch/blob/a6ac922eabee8fce7a48dedac81e82ac8cfe9a45/aten/src/ATen/native/native_functions.yaml#L5997, so it's traced to tensor overload.

torch.where + DDPoptimizer + Dynamo causes faketensor error

@bdhirsh so the write way for `where.Scalar` overload would be to set `wrapped_number=True`? I think it's just an oversight that it doesn't, and that would be the correct fix

torch.where + DDPoptimizer + Dynamo causes faketensor error

Ah I see, yeah for where we do manual type promotion instead of letting TI handle it, so yeah that means that wrapped numbers can end up being neither Long...

torch.where + DDPoptimizer + Dynamo causes faketensor error

@bdhirsh it would be great to fix `where`, the problem with wrapped number is I don't know of a way to move it to device in a non-synchronizing way (probably...

[nn] zero_grad() set_to_none default True

Can you please add a bc-breaking note here?

[nn] zero_grad() set_to_none default True

No, `set_to_none=True` decreases memory usage, as it frees grad memory when called, and doesn't allocate them again until they are computed (which will likely be after high memory watermark is...

[SDPA] Wire up FlashAttention's backward

So flash launches more kernels, but the traces above show cpu side, not actual cuda execution. Can you share raw traces?