Natalia Gimelshein

Results 214 comments of Natalia Gimelshein

I think this line is the problem https://github.com/pytorch/pytorch/pull/89485/files#diff-b5faaeef4cddee9a195a6ca3c652be163f38d4fc1b31d0b42ed5944cb41ab67fR138, @xuzhao9 can you try either reverting that PR or just that line and see if it fixes the problem?

And also this change https://github.com/pytorch/pytorch/pull/89485/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8cL1171 doesn't work for GPU

Yeah the author modified cpu implementation only, but made changes to the common path, so now gpu is getting discontiguous gradients where previously it was guaranteed to get contiguous only....

`where.scalar` is a CompositeImplicitAutograd function https://github.com/pytorch/pytorch/blob/a6ac922eabee8fce7a48dedac81e82ac8cfe9a45/aten/src/ATen/native/native_functions.yaml#L5997, so it's traced to tensor overload.

@bdhirsh so the write way for `where.Scalar` overload would be to set `wrapped_number=True`? I think it's just an oversight that it doesn't, and that would be the correct fix

Ah I see, yeah for where we do manual type promotion instead of letting TI handle it, so yeah that means that wrapped numbers can end up being neither Long...

@bdhirsh it would be great to fix `where`, the problem with wrapped number is I don't know of a way to move it to device in a non-synchronizing way (probably...

Can you please add a bc-breaking note here?

No, `set_to_none=True` decreases memory usage, as it frees grad memory when called, and doesn't allocate them again until they are computed (which will likely be after high memory watermark is...

So flash launches more kernels, but the traces above show cpu side, not actual cuda execution. Can you share raw traces?