torchtitan
torchtitan copied to clipboard
A PyTorch native library for large-scale model training
This PR uses shared memory to do async checkpoint on another process and also implements async staging (overlapping staging with the next iteration).
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #226 unsure how to proceed with the apis, but a few nice to haves are 1- specify load path separately from save...
The incompatibility is that during backwards, fused_rmsnorm does dynamic control flow over strides, which isn't safe for export tracing used by PP. ``` dy = dy.view(-1, dy.shape[-1]) if dy.stride(-1) !=...
have a test in another repo, move it and integrate for torchtrain to ensure no future issues with any changes.
Adding this as tracking issue to unblock https://github.com/pytorch/torchtrain/pull/181 from landing: per @wanchaol : IMO we should also register the fwd/bwd rmsnorm kernel as a PyTorch op, this is so that:...
FSDP + SP works fine when compile is off, but got the following error when compile is on: error log SP=2 ./run_llama_train.sh + TRAINER_DIR=/home/lty/local/torchtrain + MODEL=llama + MODEL_CONF=debugmodel + NGPU=8...
```[tasklist] ### Tasks - [ ] Add barebone MoE without expert parallelism - [ ] Prototype expert parallel with MoE - [ ] e2e integration and perf validation with torchtrain...
https://github.com/pytorch/torchtrain/blob/8dd5798241490c5f532e822e9f9c1d30e0fba0df/train.py#L155-L159 Hey there ! With @mathuvu we ran into a sneaky bug in our codebase that broke loss evolution parity with xlformers. Basically we didn't seed torch before initialising the...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #182 If we shard the embeddings as a separate FSDP parameter group, then: - In forward, we have a separate all-gather for...