llama.cpp Feature Request: Tensor Parallelism support

Feature Request: Tensor Parallelism support

Open ClarkChin08 opened this issue 6 months ago • 3 comments

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Tensor parallelism is a a critical technique employed to train and inference from very large language models by splitting the actual computations/tensors across multiple compute devices.

Motivation

In our previous implementation on Xeon CPU, tensor parallelism(TP) can significantly reduce the latency on inference.

model	precision	TP size	input_size	nex_token_time/ms
llama2-70b	q4_j	1	32	191.91
llama2-70b	q4_j	2	32	120.87
llama2-70b	q4_j	4	32	86.15
llama2-70b	q4_j	1	1024	197.18
llama2-70b	q4_j	2	1024	129.25
llama2-70b	q4_j	4	1024	91.76
llama2-70b	q4_j	1	2012	204.85
llama2-70b	q4_j	2	2012	127.31
llama2-70b	q4_j	4	2012	100.44

Notice: TP size= 1 means not use TP.

Possible Implementation

In our TP implementation, we adopt the method of pre-splitting the corresponding weights, so the time consumed for this part is one-time and does not affect inference performance. Meanwhile, another major factor impacting performance is 'all reduce'. Since each node computes partial and incomplete results, it is necessary to perform 'all reduce' on the output data. But all reduce is relatively time-consuming, interestingly, by using a reasonable splitting and combining method, primitives can be operated independently across nodes, which is very helpful for performance optimization. Thus, a rational splitting method becomes extremely important.

Taking the FFN module as an example, if the first matmul splits by column and computes the matmul with input, it will result in two unrelated sub-matrices on each node. These two sub-matrices, when performing the second matmul operation, can proceed directly without having to perform 'all reduce' if splitting by rows. Thus, the entire FFN module only requires one 'all reduce', meaning that with properly tailored split implementation, even with multiple matmul operations, only one 'all reduce' operation may be needed. We ignored the element-wise operations between matmul as they would not influence the results. The scenario for the attention module is more complex. As shown in the following figure, a rational split can make it so that the entire attention module only requires one 'all reduce' operation, thus greatly saving synchronization time.

Aug 19 '24 01:08 ClarkChin08

llama.cpp llama.cpp copied to clipboard

Feature Request: Tensor Parallelism support

Prerequisites

Feature Description

Motivation

Possible Implementation

llama.cpp
llama.cpp copied to clipboard