llm-foundry
llm-foundry copied to clipboard
Add tensor parallelism for attention QKVO
This PR implements tensor parallelism using PyTorch's new DTensor library. This can be either used standalone or in 2D parallel fashion along with other parallelism strategies like FSDP. It partitions the QKV projections and output projections Megatron style within a GPU node, producing one synchronization step at the end.
The implementation currently supports multihead and grouped query attention. I was not able to find a good way to parallelize the attention bias with AliBi in this setting - would like some advice with this.
To turn on tensor parallelism, the attention config now has two new variables:
tensor_parallel_qkvo: a boolean flag that configures TPtp_world_size: the nunmber of GPUs to tensor parallelize over
@linden-li can you please add tests
The implementation currently supports multihead and grouped query attention. I was not able to find a good way to parallelize the attention bias with AliBi in this setting - would like some advice with this.
@sashaDoubov do you still have that implementation we designed for TP-ing ALiBi bias for TP with DeepSpeed?
This needs more thorough testing, but this is what we had for chunking the alibi bias: diff: https://github.com/mosaicml/llm-foundry/compare/main...sashaDoubov:llm-foundry:benchmarking