llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Add tensor parallelism for attention QKVO

Open linden-li opened this issue 2 years ago • 3 comments

This PR implements tensor parallelism using PyTorch's new DTensor library. This can be either used standalone or in 2D parallel fashion along with other parallelism strategies like FSDP. It partitions the QKV projections and output projections Megatron style within a GPU node, producing one synchronization step at the end.

The implementation currently supports multihead and grouped query attention. I was not able to find a good way to parallelize the attention bias with AliBi in this setting - would like some advice with this.

To turn on tensor parallelism, the attention config now has two new variables:

  1. tensor_parallel_qkvo: a boolean flag that configures TP
  2. tp_world_size: the nunmber of GPUs to tensor parallelize over

linden-li avatar Oct 24 '23 03:10 linden-li

@linden-li can you please add tests

mvpatel2000 avatar Oct 24 '23 20:10 mvpatel2000

The implementation currently supports multihead and grouped query attention. I was not able to find a good way to parallelize the attention bias with AliBi in this setting - would like some advice with this.

@sashaDoubov do you still have that implementation we designed for TP-ing ALiBi bias for TP with DeepSpeed?

vchiley avatar Dec 05 '23 18:12 vchiley

This needs more thorough testing, but this is what we had for chunking the alibi bias: diff: https://github.com/mosaicml/llm-foundry/compare/main...sashaDoubov:llm-foundry:benchmarking

sashaDoubov avatar Dec 05 '23 19:12 sashaDoubov