llama.cpp MFA Integration

In collaboration with @liuliu, I have been developing a new kernel library for GEMM and attention operations. Stable Diffusion/NNC is the primary use case, but I hope to integrate into LLaMA/GGML too. The library consistently outperforms MPS by a large amount: https://twitter.com/philipturnerar/status/1669146393271730178

Here is out of the box performance, with zero fine-tuning. The table shows matrix sizes common in SD and LLaMA. For example, in the second GEMM of $sm(QK^T) V$ for LLaMA-13B, MPS achieves 5.8% utilization. MFA achieves 40% utilization, 6 times faster. That is not a percent improvement; it is an order-of-magnitude improvement.

M	N	K	F16	MFA/16	MPS/16	F32	MFA/32	MPS/32
1280	4096	320	✅	83%	68%	✅	75%	75%
1024	2560	640	✅	82%	69%	✅	76%	75%
4096	4096	40	✅	62%	35%	✅	50%	40%
4096	40	4096	✅	50%	11%	✅	36%	11%
1024	1024	80	✅	54%	42%	✅	48%	48%
1024	80	1024	✅	43%	17%	✅	40%	14%
4096	320	320	✅	79%	62%	✅	70%	68%
4096	1713	40	✅	52%	32%	✅	40%	34%
4096	40	1713	✅	46%	19%	✅	40%	9.7%
4096	92	40	✅	28%	7.2%	✅	21%	6.4%
4096	40	92	✅	27%	7.5%	✅	19%	7.3%
1805	320	768	✅	75%	51%	✅	63%	56%
1805	1280	768	✅	81%	64%	❌	67%	71%
512	512	32	✅	26%	14%	✅	20%	14%
512	32	512	✅	8.2%	8.1%	❌	7.5%	7.7%
2048	2048	32	✅	61%	40%	✅	50%	46%
2048	32	2048	✅	35%	35%	✅	35%	32%
2048	2048	40	✅	56%	32%	✅	46%	36%
2048	40	2048	✅	40%	5.8%	✅	37%	5.9%
2048	2048	52	✅	52%	32%	✅	39%	35%
2048	52	2048	✅	49%	14%	✅	47%	12%

^{Compute utilization (higher is better). Check means MFA is much faster than MPS; X means either slower or same performance.}

I will open source Metal FlashAttention, but it's not in a presentable state just yet. I am opening this thread to discuss anything relevant to integration, such as existing bottlenecks, simulation results, dependencies, etc.

cc: @ggerganov

Jun 15 '23 04:06 philipturner

In general, ggml currently lacks efficient GEMM implementation and therefore we currently rely on 3-rd party BLAS packages which is obviously not desirable. Adding GEMM to ggml will be the next big addition to the framework - both for CPU and GPU.

I am still in the process of formulating some requirements of what a potential implementation should satisfy. And I fully understand that it will be close to impossible to achieve the maximum performance available from dedicated libraries (such as cuBLAS, for example). We will definitely aim for simplicity of the implementation and quantization support.

Currently, llama.cpp does not use MPS matrix multiplication as it is too inefficient. Instead, we fallback to the AMX CBLAS interface, which is also not the best since it lacks quantization support, but it is much better than naive dot-product based matrix multiplication.

We have to take a look at the MFA that you propose, but I think the main "features" of the implementation will be it's simplicity and option to extend for quantized matrices. Third-party dependencies are highly undesired. Performance for perplexity computation will be important.

I think ideally, a GEMM implementation should make use of the existing dot product kernels (both on the CPU and GPU) in order to keep the code size manageable. Even if this means sacrificing a bit of performance.

Will be looking into formulating the above into a well-defined issue / project

Jun 17 '23 10:06 ggerganov

The biggest concern is with how MFA is compiled. It needs to use SIMD futures instructions, which have a wierd Xcode 14.2 dependency (more info here). The gist of my build system:

Swift script with a colorful command-line interface
Locates xcrun metal from Xcode 14.2 and xcrun metallib from a different Xcode
Packages the source code alongside the Metal binary
Designed so the typical user will download the metallib from GitHub releases, instead of compiling it

Build script interface

Third-party dependencies are highly undesired.

There's a dependency on AppleGPUInfo, which @liuliu is cleaning up (removing the DeviceKit dependency).

but I think the main "features" of the implementation will be it's simplicity

I designed the repo as a "reference implementation" of modern algorithms. It's as simple as possible, so you can fork it or modify it for your own needs. This is an area MPS struggled. For example, I wanted to take full control over the GPU command encoding process, because MPS has high sequential overhead. I couldn't get Apple to support more flexible encoding in MPS.

option to extend for quantized matrices.

I think the Q, K, and V during attention are all in FP16. While I do hope to support quantized GEMV, it's not a priority, since you wrote a pretty efficient kernel yourself. The biggest issue is either combining FlashAttention with a K-splitting algorithm or falling back to standard attention for very small sequence lengths.

Jun 17 '23 12:06 philipturner

https://twitter.com/philipturnerar/status/1672104485387214851

I'll discuss what I was thinking about - triangular FlashAttention. Basically, you compile a finite number of Metal compute pipeline states, for each modulo of the sequence length. You create an implicit triangular mask by skipping half of the computations. I don't know whether reordering the operations into a rectangle will increase performance.

I'll explain the algorithm better as I decide which parts of my old code base to open-source.

Jun 23 '23 12:06 philipturner

I'll discuss what I was thinking about - triangular FlashAttention.

TFA, even the abbreviation looks cool

Jun 24 '23 12:06 lin72h

@ggerganov MFA is finally in a state where we can start integration tests. I will eventually post pre-compiled binaries, but for the moment, you'll need to download Xcode 14.2 from the Apple developer archive to compile it. Ideally you compile and host your own Metal binaries for LLaMA.cpp users. Or create a custom Metallib writer that writes the SIMD async copy instructions (similar to Julia's LLVM fork for Metal).

The biggest benefit is speeding up matrix-matrix multiplication and reducing the need to close the MTLComputeCommandEncoder whenever something must be executed on MPS. Once MFA supports FlashAttention, you can remove the MPSSoftmax dependency too. However, this would only work on M1 Macs (Intel Macs would need a fallback library, maybe CLBlast).

For Float16, MFA is consistently 2x faster than MPS. MPS performance seems to have regressed a bit on this macOS 14 beta, or maybe it's a terrible driver latency bottleneck with how I encoded MPS commands. Either way this indicates a real-world speedup from greater flexibility in how you encode commands.

Jun 30 '23 06:06 philipturner

llama.cpp llama.cpp copied to clipboard

MFA Integration

llama.cpp
llama.cpp copied to clipboard