llama.cpp Eval bug: On macOS 12 / 13 metal crashes after commit 0f0a3c28

Name and Version

When running tests against the latest llama.cpp, I noticed crashes on both macOS 12 and 13 (Ventura).

It's either old MacOS version or small VRAM or maybe both that causes the problem. To make it easier to spot the exact location of the failure and resulting crash, I added the following diff

diff --git a/ggml/src/ggml-metal/ggml-metal-context.m b/ggml/src/ggml-metal/ggml-metal-context.m
index af9ff2143..e327fc152 100644
--- a/ggml/src/ggml-metal/ggml-metal-context.m
+++ b/ggml/src/ggml-metal/ggml-metal-context.m
@@ -294,10 +294,12 @@ void ggml_metal_set_tensor_async(ggml_metal_t ctx, struct ggml_tensor * tensor,
 
 void ggml_metal_get_tensor_async(ggml_metal_t ctx, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
     @autoreleasepool {
+        GGML_LOG_INFO("%s XXX calling newBufferWithBytesNoCopy data:%p size:%llu\n", __func__, data, size);
         id<MTLBuffer> buf_dst = [ctx->device newBufferWithBytesNoCopy:data
                                                                length:size
                                                               options:MTLResourceStorageModeShared
                                                           deallocator:nil];
+        GGML_ASSERT(buf_dst != nil && "newBufferWithBytesNoCopy failed");
 
         struct ggml_metal_buffer_id bid_src = ggml_metal_get_buffer_id(tensor);
         if (bid_src.metal == nil) {

To build a version compatible with older MacOS

export SDKROOT=/Applications/Xcode_14.1.0.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
export DEVELOPER_DIR=/Applications/Xcode_14.1.0.app/Contents/Developer

then build with

cmake -B build -DCMAKE_OSX_DEPLOYMENT_TARGET=12.0
cmake --build build --parallel 8

Copy the binaries to a MacOS v12 or v13 system with 16G (or 8G)

./llama-cli -m <path to llama3.2 or qwen3>
...
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_metal_get_tensor_async XXX calling newBufferWithBytesNoCopy data:0x10a910000 size:607744
/Users/ollama/code/llama.cpp/ggml/src/ggml-metal/ggml-metal-context.m:302: GGML_ASSERT(buf_dst != nil && "newBufferWithBytesNoCopy failed") failed

If you add --gpu-layers XX with 1 less than the full load, then the ggml_metal_get_tensor_async code doesn't run, it doesn't crash, and the model works properly.

Operating systems

Mac

GGML backends

Metal

Hardware

tested on Apple M1 Mac mini

Models

tested on llama3.2, qwen3

Problem description & steps to reproduce

./llama-cli -m <path to llama3.2 or qwen3>
...
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_metal_get_tensor_async XXX calling newBufferWithBytesNoCopy data:0x10a910000 size:607744
/Users/ollama/code/llama.cpp/ggml/src/ggml-metal/ggml-metal-context.m:302: GGML_ASSERT(buf_dst != nil && "newBufferWithBytesNoCopy failed") failed

First Bad Commit

https://github.com/ggml-org/llama.cpp/commit/0f0a3c2851134d49955f3c85afbb0b1bb47c3e07

Relevant log output

/Users/ollama/code/llama.cpp/ggml/src/ggml-metal/ggml-metal-context.m:302: GGML_ASSERT(buf_dst != nil && "newBufferWithBytesNoCopy failed") failed

Sep 26 '25 02:09 mchiang0610

I can confirm this error; running on MacOS 12.5.1. Happens on both llama-server and llama-cli. M1 Max, 32GB RAM; lots of empty RAM.

Oct 18 '25 03:10 swittk

I can confirm this as well.

Oct 21 '25 00:10 espoirMur

Any update? @ggerganov

Nov 15 '25 06:11 secret-ai-dev