Natalia Gimelshein comments

Results 214 comments of


                                            Natalia Gimelshein

[SDPA] Wire up FlashAttention's backward

@d4l3k More memory than math is unexpected (and, tbh, torch.compile being faster than flash attn is unexpected too), is it possible to look at the full model?

[SDPA] Wire up FlashAttention's backward

@d4l3k can you please open an issue?

[CUDA][CUDA 11] Remove more CUDA 11 version checks

@pytorchbot merge

DDP+inductor+profiler crashes on toy model

Does inductor use cudagraphs? Profiler + cudagraphs don't work together, it's an old and still unfixed issue, see e.g. https://github.com/pytorch/torchdynamo/issues/1413

DDP+inductor+profiler crashes on toy model

We have an open (and high priority) issue here #75504, it has a workaround suggested that's not really workable in most cases and no other activity

Inplace fused (leaky-)relu+dropout for memory savings

Pointwise ops are pretty reliably optimized by inductor, and new fused variants won't be added to pytorch core.

Inplace fused (leaky-)relu+dropout for memory savings

High water mark of memory is achieved towards the end of the forward pass and is determined by the things one needs to save for backward. Making some intermediate operations...

add torch.autograd._set_view_replay_enabled, use in aot autograd

Running full suites, but I think we can just wait for the dashboard to update

add torch.autograd._set_view_replay_enabled, use in aot autograd

Alternatively, could we improve AsStridedBackward for the cases where we can easily see that all original elements participated, so no need to prefill with 0s before copying grad?

turn functionalization on in aot_autograd inference

So this is all looking good, but I wonder what'll happen to optimizers once this is turned on. Optimizers are currently doing inplace updates, that inductor handles, and with this...