tianyu-l issues

Results 20 issues of


                                            tianyu-l

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used

Higher loss (9.5602 vs. 9.3164) was observed for the dtensor case, after 10 steps on the llama2 debug model. This happens even without applying rotary embedding, and the complex number...

bug

`freqs_cis` in llama model should be a non-persistent buffer

Currently it is registered as a persistent buffer, because of two reasons, copied from https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/model.py#L355 ``` # TODO persistent should be set to false, since this buffer can be recomputed....

bug

only produce tensorboard logs on rank 0 by default

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #339 1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all...

CLA Signed

[dtensor] remove `output_` prefix from OpStrategy properties

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #126359 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @wconstab @yf225 @chauhang...

oncall: distributed

ciflow/trunk

ciflow/inductor

release notes: distributed (dtensor)

[BUG] cross-entropy loss not computed correctly when label_smoothing is enabled

**Describe the bug** Currently, when `label_smoothing` is enabled, `mean_log_probs` is computed as a local mean ([code pointer](https://github.com/NVIDIA/Megatron-LM/blob/a5415fcfacef2a37416259bd38b7c4b673583675/megatron/core/tensor_parallel/cross_entropy.py#L87)). This is not the expected behavior for label smoothing, and can cause the...

stale

tianyu-l

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used

`freqs_cis` in llama model should be a non-persistent buffer

only produce tensorboard logs on rank 0 by default

[dtensor] remove `output_` prefix from OpStrategy properties

[BUG] cross-entropy loss not computed correctly when label_smoothing is enabled

remove PP tracer

add contributing guidelines

2D whole model compile fails at embedding layer

[DO NOT MERGE][example] fold batch and sequence dimensions to accelerate Sequence Parallel

benchmark perf numbers on H100 GPUs and update performance.md