zero-bubble-pipeline-parallelism
zero-bubble-pipeline-parallelism copied to clipboard
Support sequence parallel on main branch
Lazy computation of partial gradients of weights with an aid of queue is really smart!. @ufotalent
However, I don't believe that you need to support sequence parallel, a.k.a it does not provide any useful features in reducing the total tokens processed in a single machine, only little improvements on batchnorm and dropout.
Context parallel is much more preferred.