Junjie Wang issues

Results 8 issues of


                                            Junjie Wang

[9/N] Remove ST multiple ops

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #96989 Differential Revision: [D44158327](https://our.internmc.facebook.com/intern/diff/D44158327)

better-engineering

ciflow/trunk

release notes: distributed (sharded)

ciflow/periodic

[10/N] Remove ST init, binary and chunk ops

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #96985 * #96989 Differential Revision: [D44158326](https://our.internmc.facebook.com/intern/diff/D44158326)

better-engineering

ciflow/trunk

release notes: distributed (sharded)

ciflow/periodic

RFC-0029: Add TP User API design RFC to pytorch rfcs

cla signed

Remove ST from torchsnapshot

As part of ShardedTensor deprecation, we start the cleanup for its use case in torch snapshot. This is the first PR for a series PR and want to get feedback...

CLA Signed

[405B] Add performance data for 405B model

In this PR, we mostly measured the performance and loss curves for 405B model with some optimizations techniques we recently developed. We also want to log the actual peak TFLOPs...

CLA Signed

[c10d FR analyzer] Output a meaningful debug report for users

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #134528 * #134383 - This PR generates a more useful output log for users: P1552399180. - It also fixes the logic when...

oncall: distributed

topic: not user facing

suppress-bc-linter

Add a ViT Encoder to TorchTitan

This is first step to include more models into torchtitan to demonstrate composability of pretrain. Now with llama 3.2 coming and we already have it available in torch tune. We...

CLA Signed

[c10d] Enable CudaEventCache by default and add multi device support

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #140975 We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy...

oncall: distributed

ciflow/trunk

release notes: distributed (c10d)