terapipe
terapipe copied to clipboard
Implement embedding+softmax on a separate GPU
Makes the rank 0 GPU responsible for computing embedding and softmax. Both ends of the pipeline connect to the rank 0 GPU now.
For GPT3-3hm, 8 slices, mixed precision, same node (although different node is supported):
- embedding: 1.476s/step (5 GPUs, 2x Megatron 2x pipeline 1x embedding)
- no embedding: 0.305s/step (4 GPUs, 2x Megatron 2x pipeline)
Seems to be some bug or OOM issue with sequence length 2048 or larger and embedding size 1152 or larger, see the comment in the PR for details. Run with --use-embedding
to use embeddings.
Given what happened on the data-parallel side, we will deprioritize this PR for now. We will get back to this PR later.