terapipe icon indicating copy to clipboard operation
terapipe copied to clipboard

Implement embedding+softmax on a separate GPU

Open sguo35 opened this issue 4 years ago • 1 comments

Makes the rank 0 GPU responsible for computing embedding and softmax. Both ends of the pipeline connect to the rank 0 GPU now.

For GPT3-3hm, 8 slices, mixed precision, same node (although different node is supported):

  • embedding: 1.476s/step (5 GPUs, 2x Megatron 2x pipeline 1x embedding)
  • no embedding: 0.305s/step (4 GPUs, 2x Megatron 2x pipeline)

Seems to be some bug or OOM issue with sequence length 2048 or larger and embedding size 1152 or larger, see the comment in the PR for details. Run with --use-embedding to use embeddings.

sguo35 avatar Nov 19 '20 07:11 sguo35

Given what happened on the data-parallel side, we will deprioritize this PR for now. We will get back to this PR later.

zhuohan123 avatar Nov 25 '20 23:11 zhuohan123