Brian Hirsh comments

Results 150 comments of


                                            Brian Hirsh

functionalize storage resizing, minimal ppFSDP traceable forward

@ezyang I updated the PR with more details + some per-file changes. LMK if more detail would be helpful I also fixed some CI failures and addressed some of the...

functionalize storage resizing, minimal ppFSDP traceable forward

The tests in `test_distributed_patterns.py` now test with `RESIZE=True`. Updates: (1) Killed the _nn_bind_parameter prim op, changed the code in dynamo (the custom `autograd.Function`) to call `set_()` instead. the resize_() rule...

functionalize storage resizing, minimal ppFSDP traceable forward

Actually - the mentioned tests, pass, but I don't think the graph looks right. There is a `storage_resize(0)` in the graph, but it looks like we are not resizing the...

functionalize storage resizing, minimal ppFSDP traceable forward

Hmm, I think I know why we are calling `resize_(0)` on the wrong thing: (1) In the original code, we first call `x.set_(y)`, followed by `x.untyped_storage().resize_(0)`. Importantly, since the `set_()`...

functionalize storage resizing, minimal ppFSDP traceable forward

Updated, the forward graph now properly resizes the parameter before saving it for backward. Here is a paste of the ATen graph, generated with `TORCH_LOGS="aot" python test/inductor/test_distributed_patterns.py -k test_fake_distributed_inductor`: https://www.internalfb.com/phabricator/paste/view/P1202286465...

functionalize storage resizing, minimal ppFSDP traceable forward

Updated the PR to effectively change inductor's `bind_nn_parameter` lowering to lower `set_` instead, so inductor properly marks mutation.aliasing for that lowering instead of using the fallback. I also kept `can_detach()`...

functionalize storage resizing, minimal ppFSDP traceable forward

Going to continue babysitting CI, but I think I fixed the main issues: (1) fake tensor caching (2) missing header declaration that I couldn't repro locally (it was because CI...

functionalize storage resizing, minimal ppFSDP traceable forward

The latest FakeTensor caching test failures in this PR are blocked on https://github.com/pytorch/pytorch/pull/123880. That fixes a latent bug in FakeTensor, that my PR hits because I'm now including storage nbytes...

functionalize storage resizing, minimal ppFSDP traceable forward

I think https://github.com/pytorch/pytorch/pull/123880 is the only thing left for this PR to be ready, so to confirm I rebased it below this PR just to sanity check the state of...

functionalize storage resizing, minimal ppFSDP traceable forward

Updated after ignoring the failures that will be fixed by https://github.com/pytorch/pytorch/pull/123880, there were a few more failures to look into. The most interesting one is that some new dynamo tests...