distributed-llama
distributed-llama copied to clipboard
[Feature Suggest] Tensor Parallellism for Accelerating LLM
Dear Author,
Your contribution is critical for the open-source community. The distributed-llama repo has implemented tensor parallelism from scratch. And the result is amazingly significant. However, there are still improvements that could be made. Because of my poor coding ability, not able to make improvements myself, I hope you can look at my suggestions below.
Challenge: root node's special task and synchronization
When I run the repo version '0.1.0', I find that the softmax
operations in MultiHead
are conducted on the root node only. This operation costs a significant portion of the total time. Second, the synFfnA
and synFfn2
functions also cost a lot of time.
Mature solutions
In fact, these challenges have been found in this paper: https://arxiv.org/abs/1909.08053. Its solution is shown in the image:
It conducts attention mechanism(softmax) on every worker. Second, the matrix segmentation direction is using column segment and row segment in two consecutive matrices, thus reducing to one synchronization operation instead of two.
If you are willing to make further improvements to the repo, the following is the mature solution for every component of llama2
using tensor parallelism and sequence parallelism.
https://pytorch.org/tutorials/intermediate/TP_tutorial.html
However, it's implemented in Python, and you will be the first one to implement the solution in C++.
Thanks for your contribution!!! Best Regards