torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

A PyTorch native library for large-scale model training

Results 362 torchtitan issues
Sort by recently updated
recently updated
newest added

This PR uses shared memory to do async checkpoint on another process and also implements async staging (overlapping staging with the next iteration).

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #226 unsure how to proceed with the apis, but a few nice to haves are 1- specify load path separately from save...

CLA Signed

The incompatibility is that during backwards, fused_rmsnorm does dynamic control flow over strides, which isn't safe for export tracing used by PP. ``` dy = dy.view(-1, dy.shape[-1]) if dy.stride(-1) !=...

bug

have a test in another repo, move it and integrate for torchtrain to ensure no future issues with any changes.

better_engineering

Adding this as tracking issue to unblock https://github.com/pytorch/torchtrain/pull/181 from landing: per @wanchaol : IMO we should also register the fwd/bwd rmsnorm kernel as a PyTorch op, this is so that:...

bug
enhancement

FSDP + SP works fine when compile is off, but got the following error when compile is on: error log SP=2 ./run_llama_train.sh + TRAINER_DIR=/home/lty/local/torchtrain + MODEL=llama + MODEL_CONF=debugmodel + NGPU=8...

bug

```[tasklist] ### Tasks - [ ] Add barebone MoE without expert parallelism - [ ] Prototype expert parallel with MoE - [ ] e2e integration and perf validation with torchtrain...

enhancement

https://github.com/pytorch/torchtrain/blob/8dd5798241490c5f532e822e9f9c1d30e0fba0df/train.py#L155-L159 Hey there ! With @mathuvu we ran into a sneaky bug in our codebase that broke loss evolution parity with xlformers. Basically we didn't seed torch before initialising the...

bug

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #182 If we shard the embeddings as a separate FSDP parameter group, then: - In forward, we have a separate all-gather for...

CLA Signed