torchtitan issues

Implement fast checkpoint path

This PR uses shared memory to do async checkpoint on another process and also implements async staging (overlapping staging with the next iteration).

fegin

CLA Signed

RFC for ckpt apis

3

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #226 unsure how to proceed with the apis, but a few nice to haves are 1- specify load path separately from save...

wconstab

CLA Signed

Fused RMSNorm incompatible with PP tracing (dynamic stride)

2

The incompatibility is that during backwards, fused_rmsnorm does dynamic control flow over strides, which isn't safe for export tracing used by PP. ``` dy = dy.view(-1, dy.shape[-1]) if dy.stride(-1) !=...

wconstab

bug

Verify that we can do eval / inference

1

gnadathur

enhancement

add unit test for ongoing numerical verification of fusedRMSNorm

have a test in another repo, move it and integrate for torchtrain to ensure no future issues with any changes.

lessw2020

better_engineering

Make fused RMSNorm a registered op

1

Adding this as tracking issue to unblock https://github.com/pytorch/torchtrain/pull/181 from landing: per @wanchaol : IMO we should also register the fwd/bwd rmsnorm kernel as a PyTorch op, this is so that:...

lessw2020

bug

enhancement

FSDP + SP does not work with --compile

4

FSDP + SP works fine when compile is off, but got the following error when compile is on: error log SP=2 ./run_llama_train.sh + TRAINER_DIR=/home/lty/local/torchtrain + MODEL=llama + MODEL_CONF=debugmodel + NGPU=8...

tianyu-l

bug

Add support for MoE model architecture

3

```[tasklist] ### Tasks - [ ] Add barebone MoE without expert parallelism - [ ] Prototype expert parallel with MoE - [ ] e2e integration and perf validation with torchtrain...

gnadathur

enhancement

Starting off with different models across ranks and FSDP doesn't synchronise

4

https://github.com/pytorch/torchtrain/blob/8dd5798241490c5f532e822e9f9c1d30e0fba0df/train.py#L155-L159 Hey there ! With @mathuvu we ran into a sneaky bug in our codebase that broke loss evolution parity with xlformers. Basically we didn't seed torch before initialising the...

BadrYoubiIdrissi

bug

[RFC] Sharded embeddings in separate FSDP group

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #182 If we shard the embeddings as a separate FSDP parameter group, then: - In forward, we have a separate all-gather for...

awgu

CLA Signed

torchtitan
torchtitan copied to clipboard

Metadata

Implement fast checkpoint path

RFC for ckpt apis

Fused RMSNorm incompatible with PP tracing (dynamic stride)

Verify that we can do eval / inference

add unit test for ongoing numerical verification of fusedRMSNorm

Make fused RMSNorm a registered op

FSDP + SP does not work with --compile

Add support for MoE model architecture

Starting off with different models across ranks and FSDP doesn't synchronise

[RFC] Sharded embeddings in separate FSDP group

← Metadata

Owner

Metadata

torchtitan torchtitan copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchtitan
torchtitan copied to clipboard