torchtitan
torchtitan copied to clipboard
Add a 3-stage PP config
Stack from ghstack (oldest at bottom):
- -> #345
- #344
- #354
Pipelining is unique in that there is no need to stick to power-of-2 numbers of stages, and there maybe reasons an odd number is optimal depending on how you divide up your cluster.
Anyway, I use this for validation of the 1f1b schedule in a slightly-more-complicated than 2-stage but simpler than 4-stage setup.
seems to run fine, if run with an even batch size
(--training.batch_size 12)