Jiani Wang comments

Results 37 comments of


                                            Jiani Wang

improve reshard_after_forward logic

Rebase to merge the PR

example schedule csv

NOTE: Not for review now, I will test locally

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM).

> oh I think the bug was introduced here -- now with wrong indentation https://github.com/pytorch/torchtitan/pull/1776/files#diff-83b7868cc3b5fde38ae75ccd8346675495ed27207bc75c422cf8c2ef4d8096d3L210-L218 Can you elaborate more on this? Why this causing memory usage increase?

Warn that SAC + Compile for MoE models is not yet supported

> what's the issue between compile + SAC + MoE? SAC will wrap each submodule of TransformerBlock separately ([_apply_op_sac_to_transformer_block_with_flex](https://github.com/pytorch/torchtitan/blob/refs/heads/main/torchtitan/distributed/activation_checkpoint.py#L158)), which will make each submodule of TransformerBlock an instance of CheckpointWrapper....

Jiani Wang

improve reshard_after_forward logic

example schedule csv

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM).

Warn that SAC + Compile for MoE models is not yet supported

Warn that SAC + Compile for MoE models is not yet supported

Context Parallel for Qwen3

Add git SHA to wheel versions using versioningit