tianyu-l
tianyu-l
numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used
Higher loss (9.5602 vs. 9.3164) was observed for the dtensor case, after 10 steps on the llama2 debug model. This happens even without applying rotary embedding, and the complex number...
Currently it is registered as a persistent buffer, because of two reasons, copied from https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/model.py#L355 ``` # TODO persistent should be set to false, since this buffer can be recomputed....
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #339 1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #126359 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @wconstab @yf225 @chauhang...
**Describe the bug** Currently, when `label_smoothing` is enabled, `mean_log_probs` is computed as a local mean ([code pointer](https://github.com/NVIDIA/Megatron-LM/blob/a5415fcfacef2a37416259bd38b7c4b673583675/megatron/core/tensor_parallel/cross_entropy.py#L87)). This is not the expected behavior for label smoothing, and can cause the...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #555 Discussed with @wconstab and @kwen2501 , it seems PP tracer has two limitations right now: 1. It doesn't support `init_weights`, thus...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #525 As titled. Hope these guidelines could help clarify what & how to contribute to torchtitan, and make the repo more self-service....
Specifically it failed at dealing with DTensor `MaskPartial` placement of sharded embedding. This only happens when we do whole model compile. TransformerBlock-level compilation (default) + separately compiling the embedding layer...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #437 Note: This PR is for showcasing purpose only and is almost a reverse of #190. At the cost of model code...
currently we only have perf numbers on A100 GPUs