Horace He

Results 242 comments of Horace He

So interestingly, it's only if `` is the only type of action performed. `iabc.` works as expected.

@rnewman That's a separate bug, one that I can't replicate at least. Could you open up a new issue, and post any errors you might have in your dev console?...

@ad8e imo, I would either use 6 or 12. MFU was originally intended to *exclude* recomputation flops (from activation checkpointing), it seems somewhat strange to me to reinclude it here....

If I'm understanding the benchmarks correctly, it seems like we shouldn't enable it right now? Since even for fp32 on cuda, perf is quite a bit worse than the eager...

I think I'll probably add the cache back, since it seems like there's a couple more folks relying on the old behavior than I thought, and many people don't want...

Actually, we plan on moving Dynamo into PyTorch core sooner than expected (this wek). When that's done, would it also be possible to replace the call to `memory_efficient_fusion` with `dynamo.optimize`?...

@ngimel My PR doesn't fix the issue, just the underlying `guard_multiple_of` issue.

@vadimkantorov I think torch.compile does something better than producing in-place code in your cases.

@vadimkantorov For dropout + relu, we do this (run this with `AOT_FX_GRAPHS=1`) ``` @torch.compile def f(x): return F.dropout(x, 0.5, training=True).relu() f(torch.randn(20, 20, requires_grad=True, device='cuda')) ``` ``` ====== Forward graph 0...