ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Performance Regression on Apple Silicon M1: GPU → CPU Fallback in v0.12.9 (works correctly in v0.12.5)

Open rwellinger opened this issue 1 month ago • 27 comments

What is the issue?

After upgrading from Ollama v0.12.5 to v0.12.9, inference performance degraded dramatically on Apple Silicon M1. The system now uses 50% CPU instead of 100% GPU (Metal), making lyric generation and other LLM tasks unusably slow during production demos.

This worked perfectly in v0.12.5 - GPU-only inference with no CPU usage.

Relevant log output

$ ollama ps
  NAME              ID            SIZE      PROCESSOR    UNTIL
  qwen2.5:32b       ...           22 GB     100% GPU     ...
  - Processor: 100% GPU (Metal)
  - Performance: Fast, production-ready
  - CPU Usage: 0%

  Actual Behavior (v0.12.9)

  $ ollama ps
  NAME              ID            SIZE      PROCESSOR    UNTIL
  qwen2.5:32b       ...           22 GB     50% CPU      ...
  - Processor: 50% CPU (Metal acceleration appears broken)
  - Performance: Extremely slow, unusable
  - GPU Usage: Not utilized despite OLLAMA_DEVICE=metal

  Configuration

  LaunchAgent plist (~/Library/LaunchAgents/com.ollama.server.plist):
  <dict>
      <key>HOME</key>
      <string>/Users/robertw</string>
      <key>OLLAMA_HOST</key>
      <string>0.0.0.0:11434</string>
      <key>OLLAMA_ORIGINS</key>
      <string>*</string>
      <key>OLLAMA_KEEP_ALIVE</key>
      <string>3600</string>
      <key>OLLAMA_MAX_MEMORY</key>
      <string>25GiB</string>
      <key>OLLAMA_DEVICE</key>
      <string>metal</string>
      <key>PATH</key>
      <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
  </dict>

  Reproduction Steps

  1. Install Ollama v0.12.9 on Apple Silicon M4 MacBook Air (32 GB)
  2. Configure environment variables as shown above
  3. Start Ollama via LaunchAgent
  4. Load large model: ollama run qwen2.5:32b
  5. Run ollama ps → Shows 50% CPU instead of 100% GPU
  6. Downgrade to v0.12.5
  7. Restart Ollama
  8. Run ollama ps → Shows 100% GPU (correct behavior)

  Impact

  - Production blocker: Lyric generation took minutes instead of seconds during customer demo
  - Regression: v0.12.5 works perfectly with GPU-only inference
  - Workaround: Forced to stay on v0.12.5

  Related Issues

  - #11888 (M2 Pro CPU instead of Metal GPU)
  - #8623 (ollama ps shows 100% GPU but uses CPU)
  - #10445 (VRAM full but CPU/RAM actually used)

  Additional Context

  The release notes for v0.12.9 mention "Fix performance regression on CPU-only systems" - it appears this fix may have introduced a regression for Apple
  Silicon unified memory systems by incorrectly triggering CPU fallback logic.

  Question

  Is this a known issue? Should I test with OLLAMA_LLM_LIBRARY=metal as a workaround, or is this a fundamental regression in the Metal backend for v0.12.9?

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.12.9

rwellinger avatar Nov 05 '25 21:11 rwellinger

Server log may help in debugging.

rick-github avatar Nov 05 '25 21:11 rick-github

Please give 0.12.10 a try, as a metal related scheduling bug was recently fixed that may be the cause of what you're seeing.

dhiltgen avatar Nov 06 '25 17:11 dhiltgen

I confirm that it's not working on Apple M1, even with version 0.12.10. I'm using it right now, same model, same inputs, and it's superslow in the latest version, unfortunately.

ComplexPlaneDev avatar Nov 07 '25 13:11 ComplexPlaneDev

Server log may help in debugging.

rick-github avatar Nov 07 '25 13:11 rick-github

server.log

My server.log, in case it can help.

ComplexPlaneDev avatar Nov 07 '25 13:11 ComplexPlaneDev

time=2025-11-07T14:28:07.694+01:00 level=INFO source=server.go:653 msg="loading model" "model layers"=49 requested=1

@ComplexPlaneDev Have you set num_gpu for this model?

rick-github avatar Nov 07 '25 15:11 rick-github

@rick-github there has never been any need of this before. Has anything changed? The only thing I've always done was just the Ollama upgrade. The model has never changed, it has always worked at a nice speed. All of sudden it started running superslow.

FYI, the model is this one: https://ollama.com/jobautomation/OpenEuroLLM-Italian

ComplexPlaneDev avatar Nov 07 '25 17:11 ComplexPlaneDev

The parameters for this model explicitly limit the GPU layer count to 1, which will account for the slowness. This should have been the case in previous ollama versions. Since I don't have a Mac I can't speculate to the perceived slowness after an update. You can make this model run fully on the GPU by creating a copy without the restricting parameters:

% ollama show --modelfile jobautomation/OpenEuroLLM-Italian:latest | egrep -v "num_ctx|num_gpu" > Modelfile
% ollama create OpenEuroLLM-Italian

Testing:

% ollama run --verbose jobautomation/OpenEuroLLM-Italian:latest hello 2>&1 | grep "^eval rate"
eval rate:            7.65 tokens/s
% ollama run --verbose OpenEuroLLM-Italian:latest hello 2>&1 | grep "^eval rate"
eval rate:            53.00 tokens/s

rick-github avatar Nov 07 '25 17:11 rick-github

Could it be that before ollama was not taking into account those parameters, at least in this context?

ComplexPlaneDev avatar Nov 07 '25 17:11 ComplexPlaneDev

The code that processes the parameters is pretty device independent so I think it's unlikely, but I don't have access to a Mac so I can't test.

rick-github avatar Nov 07 '25 17:11 rick-github

It could be that the client was overriding the value for num_gpu and that has changed. What client are you using?

rick-github avatar Nov 07 '25 17:11 rick-github

I'm calling the model using LangChain JS, but the issue is occurring even with the plain Ollama UI interface. I'll test the parameter you mentioned. Thanks!

ComplexPlaneDev avatar Nov 07 '25 21:11 ComplexPlaneDev

As an additional test, I went back to the 0.12.5, as @rwellinger reported, and now it's fast again!

Even if I get on the command line

ollama show --modelfile jobautomation/OpenEuroLLM-Italian:latest | egrep "num_ctx|num_gpu" PARAMETER num_ctx 2048 PARAMETER num_gpu 1

So nothing has changed, apart from the Ollama version 0.12.10 => 0.12.5, and now the test done in the Ollama UI is super responsive. So there's definitively a regression somewhere, I don't know where though.

Maybe if I find a way to build it locally, I could git bisect this.

ComplexPlaneDev avatar Nov 07 '25 21:11 ComplexPlaneDev

Post the server log.

rick-github avatar Nov 07 '25 21:11 rick-github

server.log

That's the server.log of the working version. Thanks!

ComplexPlaneDev avatar Nov 07 '25 21:11 ComplexPlaneDev

0.12.8 works also fine.

Considering that those are the changes, maybe this could help to spot it: https://github.com/ollama/ollama/compare/v0.12.8...v0.12.9

ComplexPlaneDev avatar Nov 07 '25 23:11 ComplexPlaneDev

Ok, sorry, I was misled by the issue title.

I confirm that v0.12.9 still works, it's version v0.12.10 that is broken. So it must be one of those:

https://github.com/ollama/ollama/compare/v0.12.9...v0.12.10

ComplexPlaneDev avatar Nov 07 '25 23:11 ComplexPlaneDev

Could it be https://github.com/ollama/ollama/commit/6aa72830763cf694da998f5305de89701c75cea0 ?? @dhiltgen

ComplexPlaneDev avatar Nov 07 '25 23:11 ComplexPlaneDev

From the logs, Flash Attention is enabled in 0.12.10 but not 0.12.5. Try setting OLLAMA_FLASH_ATTENTION=0 in the server environment.

rick-github avatar Nov 07 '25 23:11 rick-github

The env var does not make any difference

ComplexPlaneDev avatar Nov 08 '25 20:11 ComplexPlaneDev

Log?

rick-github avatar Nov 08 '25 20:11 rick-github

server.log

ComplexPlaneDev avatar Nov 08 '25 20:11 ComplexPlaneDev

Yes, I confirm, it's https://github.com/ollama/ollama/commit/6aa72830763cf694da998f5305de89701c75cea0 !

git clone https://github.com/ollama/ollama.git git checkout v0.12.10 git revert 6aa72830763cf694da998f5305de89701c75cea0 go run . serve go run . run jobautomation/OpenEuroLLM-Italian:latest

and it works nicely, super fast.

ComplexPlaneDev avatar Nov 08 '25 21:11 ComplexPlaneDev

No, wait, hold on. I've cleared everything, did a checkout v0.12.10, no revert of anything, I'm in this state:

commit 80d34260ea16e76c9ef0d014a86cc130421855f1 (HEAD, tag: v0.12.10-rc1, tag: v0.12.10) Author: Daniel Hiltgen [email protected] Date: Wed Nov 5 12:33:01 2025 -0800

ci: re-enable signing (#12974)

commit 1ca608bcd155c771d0fed683a75d8367fe9c7144 (tag: v0.12.10-rc0) Author: nicole pardal [email protected] Date: Wed Nov 5 11:58:03 2025 -0800

embeddings: added embedding command for cl  (#12795)

Co-authored-by: A-Akhil <[email protected]>

This PR introduces a new ollama embed command that allows users to generate embeddings directly from the command line.

Added ollama embed MODEL [TEXT...] command for generating text embeddings
Supports both direct text arguments and stdin piping for scripted workflows

Outputs embeddings as JSON arrays (one per line)

Did a go run . serve, then a go run . run mymodel, and it's fast.

Now I'm really puzzled. Was the wrong binary packaged? I have no explanation.

ComplexPlaneDev avatar Nov 08 '25 21:11 ComplexPlaneDev

Thanks for the analysis. Myself I'm not so deep inside of ollama. Just very thankful that you guys make it happen.

rwellinger avatar Nov 09 '25 09:11 rwellinger

server.log

I'm uploading the server.log from the binary I got by building the v0.12.10 commit githash, and it's working fine.

Looks like the only way to reproduce this is by installing the official 0.12.10 release package, so I have no idea how to reproduce it from sources. Hope that the log could shed some light here.

ComplexPlaneDev avatar Nov 09 '25 11:11 ComplexPlaneDev

@ComplexPlaneDev can you set OLLAMA_DEBUG=2 and share an updated server log with 0.12.10 or 0.12.11?

dhiltgen avatar Nov 14 '25 00:11 dhiltgen