tianyu-l

Results 20 issues of tianyu-l

Higher loss (9.5602 vs. 9.3164) was observed for the dtensor case, after 10 steps on the llama2 debug model. This happens even without applying rotary embedding, and the complex number...

bug

Currently it is registered as a persistent buffer, because of two reasons, copied from https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/model.py#L355 ``` # TODO persistent should be set to false, since this buffer can be recomputed....

bug

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #339 1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #126359 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @wconstab @yf225 @chauhang...

oncall: distributed
ciflow/trunk
ciflow/inductor
release notes: distributed (dtensor)

**Describe the bug** Currently, when `label_smoothing` is enabled, `mean_log_probs` is computed as a local mean ([code pointer](https://github.com/NVIDIA/Megatron-LM/blob/a5415fcfacef2a37416259bd38b7c4b673583675/megatron/core/tensor_parallel/cross_entropy.py#L87)). This is not the expected behavior for label smoothing, and can cause the...

stale

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #555 Discussed with @wconstab and @kwen2501 , it seems PP tracer has two limitations right now: 1. It doesn't support `init_weights`, thus...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #525 As titled. Hope these guidelines could help clarify what & how to contribute to torchtitan, and make the repo more self-service....

CLA Signed

Specifically it failed at dealing with DTensor `MaskPartial` placement of sharded embedding. This only happens when we do whole model compile. TransformerBlock-level compilation (default) + separately compiling the embedding layer...

bug

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #437 Note: This PR is for showcasing purpose only and is almost a reverse of #190. At the cost of model code...

CLA Signed

currently we only have perf numbers on A100 GPUs

documentation