tianyu-l issues

Results 20 issues of


                                            tianyu-l

[dtensor] add support for loss parallel

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #119877 Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input...

oncall: distributed

ciflow/trunk

ciflow/inductor

release notes: distributed (dtensor)

exclude embedding in MFU computation

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #280 Per suggestion in #274: This PR removes embedding from number of parameters calculation, because embedding op doesn't do matmul. This PR...

CLA Signed

numerical issue when running SDPA with DTensor

The issue comes from the backward computation of `aten.mul` of two complex numbers from DTensors: the result will be b + a`i` when it should be a + b`i`. Not...

bug

help wanted

FSDP + SP does not work with --compile

FSDP + SP works fine when compile is off, but got the following error when compile is on: error log SP=2 ./run_llama_train.sh + TRAINER_DIR=/home/lty/local/torchtrain + MODEL=llama + MODEL_CONF=debugmodel + NGPU=8...

bug

run sdpa with dtensor

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #180 * #285 * #161 * #172 This PR gets rid of the manual adjustment of num of heads in attention layers,...

CLA Signed

unify data loading from HF and from disk

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #287 As titled. We can just use the `load_dataset` HF API to unify different use cases. 1. [`load_dataset`](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/loading_methods#datasets.load_dataset) is flexible in that,...

CLA Signed

tianyu-l

[dtensor] add support for loss parallel

exclude embedding in MFU computation

numerical issue when running SDPA with DTensor

FSDP + SP does not work with --compile

run sdpa with dtensor

unify data loading from HF and from disk

register fused rmsnorm as pytorch custom op

freezeing some part of the model

reload existing llama checkpoints

add config option to only produce tensorboard logs on rank 0