distributed-llama icon indicating copy to clipboard operation
distributed-llama copied to clipboard

[Feature Suggest] Tensor Parallellism for Accelerating LLM

Open zhengpeirong opened this issue 10 months ago • 22 comments

Dear Author,

Your contribution is critical for the open-source community. The distributed-llama repo has implemented tensor parallelism from scratch. And the result is amazingly significant. However, there are still improvements that could be made. Because of my poor coding ability, not able to make improvements myself, I hope you can look at my suggestions below.

Challenge: root node's special task and synchronization

When I run the repo version '0.1.0', I find that the softmax operations in MultiHead are conducted on the root node only. This operation costs a significant portion of the total time. Second, the synFfnA and synFfn2 functions also cost a lot of time.

Mature solutions

In fact, these challenges have been found in this paper: https://arxiv.org/abs/1909.08053. Its solution is shown in the image:

image

It conducts attention mechanism(softmax) on every worker. Second, the matrix segmentation direction is using column segment and row segment in two consecutive matrices, thus reducing to one synchronization operation instead of two.

If you are willing to make further improvements to the repo, the following is the mature solution for every component of llama2 using tensor parallelism and sequence parallelism. https://pytorch.org/tutorials/intermediate/TP_tutorial.html However, it's implemented in Python, and you will be the first one to implement the solution in C++.

Thanks for your contribution!!! Best Regards

zhengpeirong avatar Apr 26 '24 14:04 zhengpeirong