llama.cpp
llama.cpp copied to clipboard
MFA Integration
In collaboration with @liuliu, I have been developing a new kernel library for GEMM and attention operations. Stable Diffusion/NNC is the primary use case, but I hope to integrate into LLaMA/GGML too. The library consistently outperforms MPS by a large amount: https://twitter.com/philipturnerar/status/1669146393271730178
Here is out of the box performance, with zero fine-tuning. The table shows matrix sizes common in SD and LLaMA. For example, in the second GEMM of $sm(QK^T) V$ for LLaMA-13B, MPS achieves 5.8% utilization. MFA achieves 40% utilization, 6 times faster. That is not a percent improvement; it is an order-of-magnitude improvement.
M | N | K | F16 | MFA/16 | MPS/16 | F32 | MFA/32 | MPS/32 |
---|---|---|---|---|---|---|---|---|
1280 | 4096 | 320 | ✅ | 83% | 68% | ✅ | 75% | 75% |
1024 | 2560 | 640 | ✅ | 82% | 69% | ✅ | 76% | 75% |
4096 | 4096 | 40 | ✅ | 62% | 35% | ✅ | 50% | 40% |
4096 | 40 | 4096 | ✅ | 50% | 11% | ✅ | 36% | 11% |
1024 | 1024 | 80 | ✅ | 54% | 42% | ✅ | 48% | 48% |
1024 | 80 | 1024 | ✅ | 43% | 17% | ✅ | 40% | 14% |
4096 | 320 | 320 | ✅ | 79% | 62% | ✅ | 70% | 68% |
4096 | 1713 | 40 | ✅ | 52% | 32% | ✅ | 40% | 34% |
4096 | 40 | 1713 | ✅ | 46% | 19% | ✅ | 40% | 9.7% |
4096 | 92 | 40 | ✅ | 28% | 7.2% | ✅ | 21% | 6.4% |
4096 | 40 | 92 | ✅ | 27% | 7.5% | ✅ | 19% | 7.3% |
1805 | 320 | 768 | ✅ | 75% | 51% | ✅ | 63% | 56% |
1805 | 1280 | 768 | ✅ | 81% | 64% | ❌ | 67% | 71% |
512 | 512 | 32 | ✅ | 26% | 14% | ✅ | 20% | 14% |
512 | 32 | 512 | ✅ | 8.2% | 8.1% | ❌ | 7.5% | 7.7% |
2048 | 2048 | 32 | ✅ | 61% | 40% | ✅ | 50% | 46% |
2048 | 32 | 2048 | ✅ | 35% | 35% | ✅ | 35% | 32% |
2048 | 2048 | 40 | ✅ | 56% | 32% | ✅ | 46% | 36% |
2048 | 40 | 2048 | ✅ | 40% | 5.8% | ✅ | 37% | 5.9% |
2048 | 2048 | 52 | ✅ | 52% | 32% | ✅ | 39% | 35% |
2048 | 52 | 2048 | ✅ | 49% | 14% | ✅ | 47% | 12% |
Compute utilization (higher is better). Check means MFA is much faster than MPS; X means either slower or same performance.
I will open source Metal FlashAttention, but it's not in a presentable state just yet. I am opening this thread to discuss anything relevant to integration, such as existing bottlenecks, simulation results, dependencies, etc.
cc: @ggerganov
In general, ggml
currently lacks efficient GEMM implementation and therefore we currently rely on 3-rd party BLAS packages which is obviously not desirable. Adding GEMM to ggml
will be the next big addition to the framework - both for CPU and GPU.
I am still in the process of formulating some requirements of what a potential implementation should satisfy. And I fully understand that it will be close to impossible to achieve the maximum performance available from dedicated libraries (such as cuBLAS, for example). We will definitely aim for simplicity of the implementation and quantization support.
Currently, llama.cpp
does not use MPS matrix multiplication as it is too inefficient. Instead, we fallback to the AMX CBLAS interface, which is also not the best since it lacks quantization support, but it is much better than naive dot-product based matrix multiplication.
We have to take a look at the MFA that you propose, but I think the main "features" of the implementation will be it's simplicity and option to extend for quantized matrices. Third-party dependencies are highly undesired. Performance for perplexity
computation will be important.
I think ideally, a GEMM implementation should make use of the existing dot product kernels (both on the CPU and GPU) in order to keep the code size manageable. Even if this means sacrificing a bit of performance.
Will be looking into formulating the above into a well-defined issue / project
The biggest concern is with how MFA is compiled. It needs to use SIMD futures instructions, which have a wierd Xcode 14.2 dependency (more info here). The gist of my build system:
- Swift script with a colorful command-line interface
- Locates
xcrun metal
from Xcode 14.2 andxcrun metallib
from a different Xcode - Packages the source code alongside the Metal binary
- Designed so the typical user will download the
metallib
from GitHub releases, instead of compiling it
Build script interface
Third-party dependencies are highly undesired.
There's a dependency on AppleGPUInfo, which @liuliu is cleaning up (removing the DeviceKit dependency).
but I think the main "features" of the implementation will be it's simplicity
I designed the repo as a "reference implementation" of modern algorithms. It's as simple as possible, so you can fork it or modify it for your own needs. This is an area MPS struggled. For example, I wanted to take full control over the GPU command encoding process, because MPS has high sequential overhead. I couldn't get Apple to support more flexible encoding in MPS.
option to extend for quantized matrices.
I think the Q, K, and V during attention are all in FP16. While I do hope to support quantized GEMV, it's not a priority, since you wrote a pretty efficient kernel yourself. The biggest issue is either combining FlashAttention with a K-splitting algorithm or falling back to standard attention for very small sequence lengths.
https://twitter.com/philipturnerar/status/1672104485387214851
I'll discuss what I was thinking about - triangular FlashAttention. Basically, you compile a finite number of Metal compute pipeline states, for each modulo of the sequence length. You create an implicit triangular mask by skipping half of the computations. I don't know whether reordering the operations into a rectangle will increase performance.
I'll explain the algorithm better as I decide which parts of my old code base to open-source.
I'll discuss what I was thinking about - triangular FlashAttention.
TFA, even the abbreviation looks cool
@ggerganov MFA is finally in a state where we can start integration tests. I will eventually post pre-compiled binaries, but for the moment, you'll need to download Xcode 14.2 from the Apple developer archive to compile it. Ideally you compile and host your own Metal binaries for LLaMA.cpp users. Or create a custom Metallib writer that writes the SIMD async copy instructions (similar to Julia's LLVM fork for Metal).
The biggest benefit is speeding up matrix-matrix multiplication and reducing the need to close the MTLComputeCommandEncoder
whenever something must be executed on MPS. Once MFA supports FlashAttention, you can remove the MPSSoftmax dependency too. However, this would only work on M1 Macs (Intel Macs would need a fallback library, maybe CLBlast).
For Float16, MFA is consistently 2x faster than MPS. MPS performance seems to have regressed a bit on this macOS 14 beta, or maybe it's a terrible driver latency bottleneck with how I encoded MPS commands. Either way this indicates a real-world speedup from greater flexibility in how you encode commands.