torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Add Pipeline Parallel (and 2D PP+FSDP) support

Open wconstab opened this issue 1 year ago • 0 comments

Stack from ghstack (oldest at bottom):

  • #340
  • #337
  • -> #318

runs PP+DP and PP+TP without issue, runs PP+TP+DP with decreasing loss, but fails DCP save

Supports only simple schedules currently, gpipe and 1f1b.

Ads cmdline/toml arg for specifiying split points, in a unified way between tracer or manual frontend.

e.g. user can specifiy "layers.2,layers.4" as split points.

Currently uses manual frontend by default, but allows specifying tracer frontend. Tracer frontend requires working around additional compatibility limitations, indicated by raising assertions, and is not ready for wider use yet.

wconstab avatar May 09 '24 01:05 wconstab