Xilun Wu comments

Results 12 comments of


                                            Xilun Wu

Apps and benchmarks

Xilun starts to investigate the implementation of DS2 and collect training sets of reasonable size.

[dtensor] enable op db tests by using multithreaded test case

The speedup of testing is quite impressive!! Congrats!

[dtensor][8/N] switch DeviceMesh to use numpy array for devices

> > Thanks! One suggestion for unit testing would be to create a DeviceMesh in FakeMode to reproduce the issue that I had! Or maybe create a DeviceMesh inside of...

Test files not run in CI from pytorch/pytorch

re-enable DTensor tests on CPU in #118134

enable TritonFusedRMSNorm with local_map annotation

note: this test requires the land of https://github.com/pytorch/pytorch/pull/126924

Question about custom cuda operators for tensor parallelism

You can also try DTensor `local_map` as how we enabled FusedRMSNorm in torchtitan: #404 , which is the second approach in @yifuwang 's comment.

`max_batch_size` argument in `ModelArgs`

I think this argument currently serve as a placeholder and may be used in future. What do you think? @lessw2020 @tianyu-l

`max_batch_size` argument in `ModelArgs`

#585

Error while full finetuning Llama 4 Scout

https://github.com/pytorch/torchtune/blob/main/torchtune/training/checkpointing/_checkpoint_client.py#L344-L346 This will all-gather the optim state dict on ranks which could lead to high memory usage. Is this desired? @pradeepfn @calvinpelletier

Gradient norm clipping with pipeline parallelism (PP)

I believe [`local_map`](https://pytorch.org/docs/main/distributed.tensor.html#torch.distributed.tensor.experimental.local_map) is a good fit for this case, to implement a custom `clip_grad_norm_` for DTensor. @zijian-hu let me draft a PR based on your sample so that we...