Performance Regression on Apple Silicon M1: GPU → CPU Fallback in v0.12.9 (works correctly in v0.12.5)
What is the issue?
After upgrading from Ollama v0.12.5 to v0.12.9, inference performance degraded dramatically on Apple Silicon M1. The system now uses 50% CPU instead of 100% GPU (Metal), making lyric generation and other LLM tasks unusably slow during production demos.
This worked perfectly in v0.12.5 - GPU-only inference with no CPU usage.
Relevant log output
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2.5:32b ... 22 GB 100% GPU ...
- Processor: 100% GPU (Metal)
- Performance: Fast, production-ready
- CPU Usage: 0%
Actual Behavior (v0.12.9)
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2.5:32b ... 22 GB 50% CPU ...
- Processor: 50% CPU (Metal acceleration appears broken)
- Performance: Extremely slow, unusable
- GPU Usage: Not utilized despite OLLAMA_DEVICE=metal
Configuration
LaunchAgent plist (~/Library/LaunchAgents/com.ollama.server.plist):
<dict>
<key>HOME</key>
<string>/Users/robertw</string>
<key>OLLAMA_HOST</key>
<string>0.0.0.0:11434</string>
<key>OLLAMA_ORIGINS</key>
<string>*</string>
<key>OLLAMA_KEEP_ALIVE</key>
<string>3600</string>
<key>OLLAMA_MAX_MEMORY</key>
<string>25GiB</string>
<key>OLLAMA_DEVICE</key>
<string>metal</string>
<key>PATH</key>
<string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
Reproduction Steps
1. Install Ollama v0.12.9 on Apple Silicon M4 MacBook Air (32 GB)
2. Configure environment variables as shown above
3. Start Ollama via LaunchAgent
4. Load large model: ollama run qwen2.5:32b
5. Run ollama ps → Shows 50% CPU instead of 100% GPU
6. Downgrade to v0.12.5
7. Restart Ollama
8. Run ollama ps → Shows 100% GPU (correct behavior)
Impact
- Production blocker: Lyric generation took minutes instead of seconds during customer demo
- Regression: v0.12.5 works perfectly with GPU-only inference
- Workaround: Forced to stay on v0.12.5
Related Issues
- #11888 (M2 Pro CPU instead of Metal GPU)
- #8623 (ollama ps shows 100% GPU but uses CPU)
- #10445 (VRAM full but CPU/RAM actually used)
Additional Context
The release notes for v0.12.9 mention "Fix performance regression on CPU-only systems" - it appears this fix may have introduced a regression for Apple
Silicon unified memory systems by incorrectly triggering CPU fallback logic.
Question
Is this a known issue? Should I test with OLLAMA_LLM_LIBRARY=metal as a workaround, or is this a fundamental regression in the Metal backend for v0.12.9?
OS
macOS
GPU
Apple
CPU
Apple
Ollama version
0.12.9
Server log may help in debugging.
Please give 0.12.10 a try, as a metal related scheduling bug was recently fixed that may be the cause of what you're seeing.
I confirm that it's not working on Apple M1, even with version 0.12.10. I'm using it right now, same model, same inputs, and it's superslow in the latest version, unfortunately.
Server log may help in debugging.
time=2025-11-07T14:28:07.694+01:00 level=INFO source=server.go:653 msg="loading model" "model layers"=49 requested=1
@ComplexPlaneDev Have you set num_gpu for this model?
@rick-github there has never been any need of this before. Has anything changed? The only thing I've always done was just the Ollama upgrade. The model has never changed, it has always worked at a nice speed. All of sudden it started running superslow.
FYI, the model is this one: https://ollama.com/jobautomation/OpenEuroLLM-Italian
The parameters for this model explicitly limit the GPU layer count to 1, which will account for the slowness. This should have been the case in previous ollama versions. Since I don't have a Mac I can't speculate to the perceived slowness after an update. You can make this model run fully on the GPU by creating a copy without the restricting parameters:
% ollama show --modelfile jobautomation/OpenEuroLLM-Italian:latest | egrep -v "num_ctx|num_gpu" > Modelfile
% ollama create OpenEuroLLM-Italian
Testing:
% ollama run --verbose jobautomation/OpenEuroLLM-Italian:latest hello 2>&1 | grep "^eval rate"
eval rate: 7.65 tokens/s
% ollama run --verbose OpenEuroLLM-Italian:latest hello 2>&1 | grep "^eval rate"
eval rate: 53.00 tokens/s
Could it be that before ollama was not taking into account those parameters, at least in this context?
The code that processes the parameters is pretty device independent so I think it's unlikely, but I don't have access to a Mac so I can't test.
It could be that the client was overriding the value for num_gpu and that has changed. What client are you using?
I'm calling the model using LangChain JS, but the issue is occurring even with the plain Ollama UI interface. I'll test the parameter you mentioned. Thanks!
As an additional test, I went back to the 0.12.5, as @rwellinger reported, and now it's fast again!
Even if I get on the command line
ollama show --modelfile jobautomation/OpenEuroLLM-Italian:latest | egrep "num_ctx|num_gpu" PARAMETER num_ctx 2048 PARAMETER num_gpu 1
So nothing has changed, apart from the Ollama version 0.12.10 => 0.12.5, and now the test done in the Ollama UI is super responsive. So there's definitively a regression somewhere, I don't know where though.
Maybe if I find a way to build it locally, I could git bisect this.
Post the server log.
0.12.8 works also fine.
Considering that those are the changes, maybe this could help to spot it: https://github.com/ollama/ollama/compare/v0.12.8...v0.12.9
Ok, sorry, I was misled by the issue title.
I confirm that v0.12.9 still works, it's version v0.12.10 that is broken. So it must be one of those:
https://github.com/ollama/ollama/compare/v0.12.9...v0.12.10
Could it be https://github.com/ollama/ollama/commit/6aa72830763cf694da998f5305de89701c75cea0 ?? @dhiltgen
From the logs, Flash Attention is enabled in 0.12.10 but not 0.12.5. Try setting OLLAMA_FLASH_ATTENTION=0 in the server environment.
The env var does not make any difference
Log?
Yes, I confirm, it's https://github.com/ollama/ollama/commit/6aa72830763cf694da998f5305de89701c75cea0 !
git clone https://github.com/ollama/ollama.git git checkout v0.12.10 git revert 6aa72830763cf694da998f5305de89701c75cea0 go run . serve go run . run jobautomation/OpenEuroLLM-Italian:latest
and it works nicely, super fast.
No, wait, hold on. I've cleared everything, did a checkout v0.12.10, no revert of anything, I'm in this state:
commit 80d34260ea16e76c9ef0d014a86cc130421855f1 (HEAD, tag: v0.12.10-rc1, tag: v0.12.10) Author: Daniel Hiltgen [email protected] Date: Wed Nov 5 12:33:01 2025 -0800
ci: re-enable signing (#12974)
commit 1ca608bcd155c771d0fed683a75d8367fe9c7144 (tag: v0.12.10-rc0) Author: nicole pardal [email protected] Date: Wed Nov 5 11:58:03 2025 -0800
embeddings: added embedding command for cl (#12795)
Co-authored-by: A-Akhil <[email protected]>
This PR introduces a new ollama embed command that allows users to generate embeddings directly from the command line.
Added ollama embed MODEL [TEXT...] command for generating text embeddings
Supports both direct text arguments and stdin piping for scripted workflows
Outputs embeddings as JSON arrays (one per line)
Did a go run . serve, then a go run . run mymodel, and it's fast.
Now I'm really puzzled. Was the wrong binary packaged? I have no explanation.
Thanks for the analysis. Myself I'm not so deep inside of ollama. Just very thankful that you guys make it happen.
I'm uploading the server.log from the binary I got by building the v0.12.10 commit githash, and it's working fine.
Looks like the only way to reproduce this is by installing the official 0.12.10 release package, so I have no idea how to reproduce it from sources. Hope that the log could shed some light here.
@ComplexPlaneDev can you set OLLAMA_DEBUG=2 and share an updated server log with 0.12.10 or 0.12.11?