llama.cpp
llama.cpp copied to clipboard
ggml : add ANE backend
According to this https://github.com/ggerganov/llama.cpp/discussions/336#discussioncomment-11184134, there is a new CoreML API and an ANE backend might be possible to implement with latest Apple software/hardware.
It seems that it is only available for matrix multiplications and not for custom operations. Apologies for my ignorance.
Would that be of any help the LM Studio has implemented MLX. And here is Anemll ANE library to work with MLX it is MIT Licensed. And there's FastMLX with an Apache 2.0 license.
I'm very interested in this direction and would like to share some findings from my experiments:
I created a demo implementation using MLTensor for matrix multiplication and compared it with a Metal implementation.
Example Code:
func mulMat(a: [Float], aRows: Int, aCols: Int,
b: [Float], bRows: Int, bCols: Int,
c: inout [Float]) throws -> Double {
print("CoreML Device: Performing matrix multiplication - Matrix dimensions [\(aRows)x\(aCols)] * [\(bRows)x\(bCols)] = [\(aRows)x\(bCols)]")
// Validate input parameters
if a.count < aRows * aCols || b.count < bRows * bCols || c.count < aRows * bCols {
print("CoreML Device: Input matrix dimensions mismatch")
throw DeviceError.invalidParameters("Input matrix dimensions mismatch")
}
// Validate dimension compatibility
if aCols != bRows {
print("CoreML Device: Matrix dimensions incompatible, A's columns (\(aCols)) must equal B's rows (\(bRows))")
throw DeviceError.invalidParameters("Matrix dimensions incompatible")
}
if #available(macOS 15.0, *) {
let startTime = CFAbsoluteTimeGetCurrent()
let aTensor = MLTensor(shape: [aRows, aCols], scalars: a)
let bTensor = MLTensor(shape: [bRows, bCols], scalars: b)
let result = aTensor.matmul(bTensor)
// Use semaphore to wait synchronously for async operation
let semaphore = DispatchSemaphore(value: 0)
var resultArray: [Float] = []
Task {
do {
let shapedArray = await result.shapedArray(of: Float.self)
resultArray = Array(shapedArray.scalars)
}
semaphore.signal()
}
// Wait for async operation to complete
semaphore.wait()
// Copy results to output parameter c
for i in 0..<min(c.count, resultArray.count) {
c[i] = resultArray[i]
}
let endTime = CFAbsoluteTimeGetCurrent()
let duration = (endTime - startTime) * 1000 // Convert to milliseconds
print("CoreML Device: Execution time - \(String(format: "%.2f", duration)) ms")
return duration
} else {
throw DeviceError.executionFailed("System version lower than macOS 15.0, MLTensor unavailable")
}
}
Here's what I discovered:
-
Performance comparison between MLTensor and Metal showed no significant differences for matrix operations.
-
I used Apple's Instruments app to monitor hardware utilization during execution, and interestingly, MLTensor was actually utilizing the GPU rather than ANE for computations.
- After thoroughly searching through Apple's official documentation (which is quite limited on this topic), I found the withMLTensorComputePolicy API. However, this API only supports two options: "CPU only" and "CPU and GPU".
Based on these observations, I suspect that the current CoreML framework still does not support ANE at the operator level. The API appears to lack explicit control for directing specific operations to the Neural Engine.
If my understanding is incorrect, I would appreciate any clarification.
- After thoroughly searching through Apple's official documentation (which is quite limited on this topic), I found the withMLTensorComputePolicy API. However, this API only supports two options: "CPU only" and "CPU and GPU".
Based on these observations, I suspect that the current CoreML framework still does not support ANE at the operator level. The API appears to lack explicit control for directing specific operations to the Neural Engine.
If my understanding is incorrect, I would appreciate any clarification.
Forgive me if this is useless as I am not an expert on this topic but this comment and this link show otherwise.
@optlink Thank you for sharing the information. I've investigated https://github.com/ggml-org/llama.cpp/discussions/336#discussioncomment-11184134 and found that the computation still occurs on the GPU even after setting MLComputeUnits.cpuAndNeuralEngine.
I tested with the following code:
resultArray = await withMLTensorComputePolicy(
MLComputePolicy.init(.cpuAndNeuralEngine)
) {
let aTensor = MLTensor(shape: [aRows, aCols], scalars: a)
let bTensor = MLTensor(shape: [bRows, bCols], scalars: b)
let result = aTensor.matmul(bTensor)
let shapedArray = await result.shapedArray(of: Float.self)
return Array(shapedArray.scalars)
}
However, I observed that the operations still executed on the GPU. Even after explicitly setting MLComputeUnits.cpuOnly, the computation didn't shift to the CPU as expected.
The documentation for withMLTensorComputePolicy is quite limited, so I'm not sure if my usage is incorrect or if there's an underlying issue with the API.
@BB-fat Did you try different shapes? Maybe the NPU supports only specific shapes (i.e. like multiples of 16/32 etc.).
Hi @ggerganov , I've tried shapes that are multiples of 16/32, but still can't use the ANE.
I found a reply from an Apple engineer on the Apple Developer Forum, which included a sample code snippet containing withMLTensorComputePolicy. It appears to be consistent with my usage.
@BB-fat I've created a post on Apple Developer Forums to try and get some help from Apple engineers regarding this: https://developer.apple.com/forums/thread/775589
my understanding from reading the API docs and looking at https://i.blackhat.com/asia-21/Friday-Handouts/as21-Wu-Apple-Neural_Engine.pdf (from 2021 but probably still useful for architecture overview) is that ANE is driven by compiled graphs and not by calls from host. so while it's possible that calling .matmul would create a kernel, compile, upload, and schedule it, it's likely not intended for this purpose at all. but the .mlmodel file linked here https://github.com/ggml-org/llama.cpp/discussions/336#discussioncomment-6149786 does map to ANE.
you can also see that the .mlmodel format is seemingly intended not for single kernels but for whole graphs. this makes sense from hardware architecture point of view -- this way you need less involvement and synchronization from host CPU
also found this tool, kind of outdated but you can see the model loading flow for ANE too, involves compilation steps https://github.com/fredyshox/ANECompat/blob/master/src/ANECompat.m
i would expect that integrating this into llama cpp would involve writing a "lowered" version of every supported model architecture that would put as much as possible into one MLModel object, which is then compiled at runtime and scheduled for operations like e.g. forward pass. I think this is what https://github.com/Anemll/Anemll/ is doing.
Edit: found more resources about direct ANE access:
https://github.com/geohot/tinygrad/blob/master/extra/accel/ane/README.md
https://github.com/freedomtan/coreml_to_ane_hwx
all of this circumvents official public APIs in some way; the public API is to compile mlmodel graphs (which are just protobufs) via a call to CoreML.Framework
Checkout out ANEMLL repo https://github.com/Anemll/Anemll
A possibility to consider is that the OS is preferentially scheduling on the GPU because on the tested hardware, the GPU is faster. It might behave differently on a base model M* (assuming @BB-fat was using a Pro, Max, or Ultra) or on an iDevice SoC. This might be something that can be influenced by running in reduced power mode, since ANE seems considerably more efficient than GPU with the same workload.
Would that be of any help the LM Studio has implemented MLX. And here is Anemll ANE library to work with MLX it is MIT Licensed. And there's FastMLX with an Apache 2.0 license.
anemll doesn't using mlx, it's convert model to coreml solve this
Another way to target the ANE is to use MPSGraph.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraphoptimization/level1 MPSGraph optimisation Level1 triggers the placement pass which enables operators to execute on the CPU or NPU.
You might also want to enable https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraphcompilationdescriptor/reducedprecisionfastmath.
The advantage of going that route is that you can build the graph at runtime rather than relying on CoreML. That said wonder how brittle it actually is (it's actually what PyTorch uses) and whether it'd make sense for it to coexist with the current Metal backend...
However as a limit to highlight: for quantisation, it's worth noting that the dequant provided only supports 4/8-bit elements: https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/dequantize(_:luttensor:axis:name:) which is probably quite a roadblock blocking relying solely on MPSGraph...
...it's very brittle, with a high density of performance cliffs