Peirong Zheng
Peirong Zheng
可惜了
Great! I want to share the [news](https://www.phoronix.com/news/Raspberry-Pi-OS-Default-V3DV) and [video](https://www.youtube.com/watch?v=gisEXzue5xI) that show the Vulkan GPU hardware support is available on Raspberry Pi OS officially.
 Just as a supplement, the figure shows detailed time costs for each task when 4 Raspberry Pis run Llama2-7B-Q40. As you can see, how much time the aforementioned functionk...
@b4rtaz The 'qkv' has been reverted. Do you plan to deal with this issue? Not only the 'MulHead' costs time, but also the 'Finalize' costs a big portion of time.
@b4rtaz Thanks for your persistence and endeavor. 1. The `qkv` can be optimized, and all you need to read is the "3. Model Parallel Transformers" of the paper "Megatron-LM: Training...
The optimized result will be only **72%** of the original generated time!!! It's **1.39x acceleration** than this version. I have roughly computed the optimized result. Specifically, the main transfer time...
> @zhengpeirong this is just a guess, have you proved that by any implementation? > > Currently I [noticed](https://github.com/b4rtaz/distributed-llama/pull/38) a problem with the rope layer, it's not easy to split...
@b4rtaz 🎉You have completed the sota tensor parallel for Attention Layer!!! Moreover, continuing our discussion before, there are still two optimizations that can be done: 1. Computation: The last layer(`Finalize`)...
https://github.com/huggingface/transformers/blob/bb48e921868ac750417956de941606f7e2fa02ca/src/transformers/models/llama/modeling_llama.py#L199-L219 @b4rtaz Just so your reference, this code implements the FFN layer of llama with Tensor Parallel acceleration. In summary, the only 2 dimensions Tensor Parallel divides for the Attention...
> @zhengpeirong it seems after I adjusted mlp layers to your suggestion the transfer has dropped by ~40% per token. 🤯 > > | Devices | 0.5.0 | PR |...