llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml : add ANE backend

Open ggerganov opened this issue 1 year ago • 9 comments

According to this https://github.com/ggerganov/llama.cpp/discussions/336#discussioncomment-11184134, there is a new CoreML API and an ANE backend might be possible to implement with latest Apple software/hardware.

ggerganov avatar Nov 22 '24 08:11 ggerganov

It seems that it is only available for matrix multiplications and not for custom operations. Apologies for my ignorance.

FSSRepo avatar Nov 22 '24 22:11 FSSRepo

Would that be of any help the LM Studio has implemented MLX. And here is Anemll ANE library to work with MLX it is MIT Licensed. And there's FastMLX with an Apache 2.0 license.

qdrddr avatar Feb 20 '25 14:02 qdrddr

I'm very interested in this direction and would like to share some findings from my experiments:

I created a demo implementation using MLTensor for matrix multiplication and compared it with a Metal implementation.

Example Code:

    func mulMat(a: [Float], aRows: Int, aCols: Int,
               b: [Float], bRows: Int, bCols: Int,
               c: inout [Float]) throws -> Double {
        
        print("CoreML Device: Performing matrix multiplication - Matrix dimensions [\(aRows)x\(aCols)] * [\(bRows)x\(bCols)] = [\(aRows)x\(bCols)]")

        // Validate input parameters
        if a.count < aRows * aCols || b.count < bRows * bCols || c.count < aRows * bCols {
            print("CoreML Device: Input matrix dimensions mismatch")
            throw DeviceError.invalidParameters("Input matrix dimensions mismatch")
        }
        
        // Validate dimension compatibility
        if aCols != bRows {
            print("CoreML Device: Matrix dimensions incompatible, A's columns (\(aCols)) must equal B's rows (\(bRows))")
            throw DeviceError.invalidParameters("Matrix dimensions incompatible")
        }

        if #available(macOS 15.0, *) {
            let startTime = CFAbsoluteTimeGetCurrent()

            let aTensor = MLTensor(shape: [aRows, aCols], scalars: a)
            let bTensor = MLTensor(shape: [bRows, bCols], scalars: b)
            let result = aTensor.matmul(bTensor)
            
            // Use semaphore to wait synchronously for async operation
            let semaphore = DispatchSemaphore(value: 0)
            var resultArray: [Float] = []
            
            Task {
                do {
                    let shapedArray = await result.shapedArray(of: Float.self)
                    resultArray = Array(shapedArray.scalars)
                }
                semaphore.signal()
            }
            
            // Wait for async operation to complete
            semaphore.wait()
            
            // Copy results to output parameter c
            for i in 0..<min(c.count, resultArray.count) {
                c[i] = resultArray[i]
            }

            let endTime = CFAbsoluteTimeGetCurrent()
            let duration = (endTime - startTime) * 1000 // Convert to milliseconds
            print("CoreML Device: Execution time - \(String(format: "%.2f", duration)) ms")
        
            return duration
        } else {
            throw DeviceError.executionFailed("System version lower than macOS 15.0, MLTensor unavailable")
        }
    }

Here's what I discovered:

  1. Performance comparison between MLTensor and Metal showed no significant differences for matrix operations.

  2. I used Apple's Instruments app to monitor hardware utilization during execution, and interestingly, MLTensor was actually utilizing the GPU rather than ANE for computations.

Image
  1. After thoroughly searching through Apple's official documentation (which is quite limited on this topic), I found the withMLTensorComputePolicy API. However, this API only supports two options: "CPU only" and "CPU and GPU".

Based on these observations, I suspect that the current CoreML framework still does not support ANE at the operator level. The API appears to lack explicit control for directing specific operations to the Neural Engine.

If my understanding is incorrect, I would appreciate any clarification.

BB-fat avatar Feb 27 '25 08:02 BB-fat

  1. After thoroughly searching through Apple's official documentation (which is quite limited on this topic), I found the withMLTensorComputePolicy API. However, this API only supports two options: "CPU only" and "CPU and GPU".

Based on these observations, I suspect that the current CoreML framework still does not support ANE at the operator level. The API appears to lack explicit control for directing specific operations to the Neural Engine.

If my understanding is incorrect, I would appreciate any clarification.

Forgive me if this is useless as I am not an expert on this topic but this comment and this link show otherwise.

optlink avatar Feb 27 '25 17:02 optlink

@optlink Thank you for sharing the information. I've investigated https://github.com/ggml-org/llama.cpp/discussions/336#discussioncomment-11184134 and found that the computation still occurs on the GPU even after setting MLComputeUnits.cpuAndNeuralEngine.

I tested with the following code:

resultArray = await withMLTensorComputePolicy(
    MLComputePolicy.init(.cpuAndNeuralEngine)
) {
    let aTensor = MLTensor(shape: [aRows, aCols], scalars: a)
    let bTensor = MLTensor(shape: [bRows, bCols], scalars: b)
    let result = aTensor.matmul(bTensor)
    let shapedArray = await result.shapedArray(of: Float.self)
    return Array(shapedArray.scalars)
}

However, I observed that the operations still executed on the GPU. Even after explicitly setting MLComputeUnits.cpuOnly, the computation didn't shift to the CPU as expected.

The documentation for withMLTensorComputePolicy is quite limited, so I'm not sure if my usage is incorrect or if there's an underlying issue with the API.

BB-fat avatar Feb 28 '25 03:02 BB-fat

@BB-fat Did you try different shapes? Maybe the NPU supports only specific shapes (i.e. like multiples of 16/32 etc.).

ggerganov avatar Feb 28 '25 06:02 ggerganov

Hi @ggerganov , I've tried shapes that are multiples of 16/32, but still can't use the ANE.

BB-fat avatar Feb 28 '25 07:02 BB-fat

I found a reply from an Apple engineer on the Apple Developer Forum, which included a sample code snippet containing withMLTensorComputePolicy. It appears to be consistent with my usage.

Image

BB-fat avatar Feb 28 '25 08:02 BB-fat

@BB-fat I've created a post on Apple Developer Forums to try and get some help from Apple engineers regarding this: https://developer.apple.com/forums/thread/775589

giladgd avatar Mar 01 '25 01:03 giladgd

my understanding from reading the API docs and looking at https://i.blackhat.com/asia-21/Friday-Handouts/as21-Wu-Apple-Neural_Engine.pdf (from 2021 but probably still useful for architecture overview) is that ANE is driven by compiled graphs and not by calls from host. so while it's possible that calling .matmul would create a kernel, compile, upload, and schedule it, it's likely not intended for this purpose at all. but the .mlmodel file linked here https://github.com/ggml-org/llama.cpp/discussions/336#discussioncomment-6149786 does map to ANE.

you can also see that the .mlmodel format is seemingly intended not for single kernels but for whole graphs. this makes sense from hardware architecture point of view -- this way you need less involvement and synchronization from host CPU

also found this tool, kind of outdated but you can see the model loading flow for ANE too, involves compilation steps https://github.com/fredyshox/ANECompat/blob/master/src/ANECompat.m

i would expect that integrating this into llama cpp would involve writing a "lowered" version of every supported model architecture that would put as much as possible into one MLModel object, which is then compiled at runtime and scheduled for operations like e.g. forward pass. I think this is what https://github.com/Anemll/Anemll/ is doing.

Edit: found more resources about direct ANE access:

https://github.com/geohot/tinygrad/blob/master/extra/accel/ane/README.md

https://github.com/freedomtan/coreml_to_ane_hwx

all of this circumvents official public APIs in some way; the public API is to compile mlmodel graphs (which are just protobufs) via a call to CoreML.Framework

vlad-ivanov-name avatar Mar 02 '25 14:03 vlad-ivanov-name

Checkout out ANEMLL repo https://github.com/Anemll/Anemll

Anemll avatar Mar 08 '25 13:03 Anemll

A possibility to consider is that the OS is preferentially scheduling on the GPU because on the tested hardware, the GPU is faster. It might behave differently on a base model M* (assuming @BB-fat was using a Pro, Max, or Ultra) or on an iDevice SoC. This might be something that can be influenced by running in reduced power mode, since ANE seems considerably more efficient than GPU with the same workload.

easp avatar Mar 27 '25 15:03 easp

Would that be of any help the LM Studio has implemented MLX. And here is Anemll ANE library to work with MLX it is MIT Licensed. And there's FastMLX with an Apache 2.0 license.

anemll doesn't using mlx, it's convert model to coreml solve this

lgyStoic avatar Jul 09 '25 02:07 lgyStoic

Another way to target the ANE is to use MPSGraph.

https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraphoptimization/level1 MPSGraph optimisation Level1 triggers the placement pass which enables operators to execute on the CPU or NPU.

You might also want to enable https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraphcompilationdescriptor/reducedprecisionfastmath.

The advantage of going that route is that you can build the graph at runtime rather than relying on CoreML. That said wonder how brittle it actually is (it's actually what PyTorch uses) and whether it'd make sense for it to coexist with the current Metal backend...

However as a limit to highlight: for quantisation, it's worth noting that the dequant provided only supports 4/8-bit elements: https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/dequantize(_:luttensor:axis:name:) which is probably quite a roadblock blocking relying solely on MPSGraph...

mediouni-m avatar Oct 27 '25 02:10 mediouni-m

...it's very brittle, with a high density of performance cliffs

mediouni-m avatar Nov 21 '25 21:11 mediouni-m