llm-foundry Add tensor parallelism for attention QKVO

Add tensor parallelism for attention QKVO

Open linden-li opened this issue 2 years ago • 3 comments

This PR implements tensor parallelism using PyTorch's new DTensor library. This can be either used standalone or in 2D parallel fashion along with other parallelism strategies like FSDP. It partitions the QKV projections and output projections Megatron style within a GPU node, producing one synchronization step at the end.

The implementation currently supports multihead and grouped query attention. I was not able to find a good way to parallelize the attention bias with AliBi in this setting - would like some advice with this.

To turn on tensor parallelism, the attention config now has two new variables:

tensor_parallel_qkvo: a boolean flag that configures TP
tp_world_size: the nunmber of GPUs to tensor parallelize over

Oct 24 '23 03:10 linden-li

@linden-li can you please add tests

Oct 24 '23 20:10 mvpatel2000

The implementation currently supports multihead and grouped query attention. I was not able to find a good way to parallelize the attention bias with AliBi in this setting - would like some advice with this.

@sashaDoubov do you still have that implementation we designed for TP-ing ALiBi bias for TP with DeepSpeed?

Dec 05 '23 18:12 vchiley

This needs more thorough testing, but this is what we had for chunking the alibi bias: diff: https://github.com/mosaicml/llm-foundry/compare/main...sashaDoubov:llm-foundry:benchmarking

Dec 05 '23 19:12 sashaDoubov

llm-foundry llm-foundry copied to clipboard

Add tensor parallelism for attention QKVO

llm-foundry
llm-foundry copied to clipboard