zero-bubble-pipeline-parallelism
zero-bubble-pipeline-parallelism copied to clipboard
[QUESTION] Whether to split bw when send_backward_recv_forward is not enabled
Hi, very appreciate your work. I have a question for zbh1 mode.
This is one part of your code:
# For BWF pattern or in rank 0, we don't split W and B for reasons below.
# 1. to leverage batched p2p op (send_backward_recv_forward)
# 2. to overlap grad all-reduce for tensor parallel
# 3. to avoid redoing grad all-gather for sequence parallel
# Note that the order of grad accumulation is changed by this behavior,
# thus causing a minor precision error compared to 1F1B even it's mathematically correct.
WeightGradStore.split_bw = (i < rank or last_iteration) and rank > 0
You said that there is no need to split bw for BWF pattern.
My question is if we do not enable send_backward_recv_forward, is it better to split bw? A finer grain makes a smaller bubble, doesn't it?