torchtitan issues

Question: tp able to run a model which not able to fit a single batch on GPU?

16

It can make model able to train a big model where GPU can not even fit batchsize =1?

Use stateful dataloader to checkpoint data iteration order and token buffer

7

Summary: Use the stateful_dataloader from torchdata (https://github.com/pytorch/data/tree/main/torchdata/stateful_dataloader) for storing the token buffer and iteration data order. It requires a dependency on the nightly build of torchdata >= 20240426. Also make...

gokulavasan

CLA Signed

exclude embedding in MFU computation

2

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #280 Per suggestion in #274: This PR removes embedding from number of parameters calculation, because embedding op doesn't do matmul. This PR...

tianyu-l

CLA Signed

torch.compile each TransformerBlock instead of the whole model

7

This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure...

wanchaol

CLA Signed

Remove unneeded torchvision/audio deps

1

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #250

wconstab

CLA Signed

lr scheduler - update global states into optimizer

2

lr scheduler currently maintains two global states to implement the full lr warmup and decay. We want to remove these: "nit: we can make these two arguments still as function...

lessw2020

better_engineering

numerical issue when running SDPA with DTensor

1

The issue comes from the backward computation of `aten.mul` of two complex numbers from DTensors: the result will be b + a`i` when it should be a + b`i`. Not...

tianyu-l

bug

help wanted

Add support for seed checkpoint creation for meta-init flow

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #180 * #285 * #161 * __->__ #172 Adds new command ./create_seed_checkpoint.sh which largely reuses code inside train.py to create the model and...

wconstab

CLA Signed

Add Pipeline Parallel (and 2D PP+FSDP) support

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #308 * __->__ #161 ---- - uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes...

wconstab

CLA Signed

[Draft] PP tracing works

Purpose of this PR is to show: 1. One line change needed -- remove this line: ``` self.freqs_cis = self.freqs_cis.to(h.device) ``` Reason 1: compile does not support in-place attribute mutation....

kwen2501

CLA Signed

torchtitan
torchtitan copied to clipboard

Metadata

Question: tp able to run a model which not able to fit a single batch on GPU?

Use stateful dataloader to checkpoint data iteration order and token buffer

exclude embedding in MFU computation

torch.compile each TransformerBlock instead of the whole model

Remove unneeded torchvision/audio deps

lr scheduler - update global states into optimizer

numerical issue when running SDPA with DTensor

Add support for seed checkpoint creation for meta-init flow

Add Pipeline Parallel (and 2D PP+FSDP) support

[Draft] PP tracing works

← Metadata

Owner

Metadata

torchtitan torchtitan copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchtitan
torchtitan copied to clipboard