Andrew Gu
Andrew Gu
@qsh-zh Thanks for your interest in FSDP2! Your concerns make a lot of sense to me. **API Stability** IIUC, PyTorch has a feature classification of prototype -> beta -> stable....
I updated the issue tracker this morning after seeing your comment. There are still a few things that may need a bit more validation, but the main items are all...
Yes, FSDP2 should address this. The memory usage is deterministic.
I am out for a week so cannot give a detailed response, but you can look at the recordStream part of the RFC linked at the bottom of this original...
might be nice to fold this as part of the `train_context`
cc: @tianyu-l on thoughts on how to handle this perhaps separate forward and backward contexts
I think Rohan added a tentative mixed precision API for DDP, but it never made it to public feature. I think using AMP is probably the way to go.
@tianyu-l I think it's also acceptable for now to allow the `norm` to be assigned to the root module. In other words, just wrap `tok_embeddings` separately and `output` separately.
cc: @weifengpy @mori360
@mingdianliu which version of PyTorch are you using? maybe you need a newer version