Xilun Wu
Xilun Wu
Xilun starts to investigate the implementation of DS2 and collect training sets of reasonable size.
The speedup of testing is quite impressive!! Congrats!
> > Thanks! One suggestion for unit testing would be to create a DeviceMesh in FakeMode to reproduce the issue that I had! Or maybe create a DeviceMesh inside of...
re-enable DTensor tests on CPU in #118134
note: this test requires the land of https://github.com/pytorch/pytorch/pull/126924
You can also try DTensor `local_map` as how we enabled FusedRMSNorm in torchtitan: #404 , which is the second approach in @yifuwang 's comment.
I think this argument currently serve as a placeholder and may be used in future. What do you think? @lessw2020 @tianyu-l
https://github.com/pytorch/torchtune/blob/main/torchtune/training/checkpointing/_checkpoint_client.py#L344-L346 This will all-gather the optim state dict on ranks which could lead to high memory usage. Is this desired? @pradeepfn @calvinpelletier
I believe [`local_map`](https://pytorch.org/docs/main/distributed.tensor.html#torch.distributed.tensor.experimental.local_map) is a good fit for this case, to implement a custom `clip_grad_norm_` for DTensor. @zijian-hu let me draft a PR based on your sample so that we...