PiPPy issues

FSDP+PP bug where reshard_after_forward must be true

6

https://github.com/pytorch/torchtitan/pull/161/files#diff-80b04fce2b861d9470c6160853441793678ca13904dae2a9b8b7145f29cd017aR269 IIRC @awgu mentioned there was an issue requiring this setting for the time being. Not sure why or if it has been fixed yet?

wconstab

PP Tracer doesn't work with fused_rmsnorm

2

Currently have to work around by using regular `rmsnorm` for PP to be enabled ``` torch._dynamo.exc.Unsupported: Illegal getattr invocation stride in strict mode # coming from ` if dy.stride(-1) !=...

wconstab

Use non-strict mode by default

1

`torch.export` has strict mode and non-strict mode. For difference, please read [Non-Strict Export](https://pytorch.org/docs/stable/export.html#non-strict-export). This PR switches to non-strict mode by default. Improving tracing success rate (no Dynamo graph break).

kwen2501

cla signed

FSDP+PP tracer issue with cast-to-bf16

9

https://github.com/pytorch/torchtitan/pull/161/files#diff-80b04fce2b861d9470c6160853441793678ca13904dae2a9b8b7145f29cd017aR254 In principle, the issue is that the PP model code traced the non-FSDP model, and in that case, the model code ran a .to(f32) operation which was a no-op...

wconstab

refactor manual stage, include docs and example

Add doc string for manual stage and example under `basic/` Made input_args a required argument Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #1109

H-Huang