Tufan Demir
Tufan Demir
  Additionally, there is a significant slowdown in inference speed. While I was getting 148 tokens per second for Gemma 2 9B Q4 with an RTX 5090, now the...
To avoid confusion, I deleted the previous log. The log from my first post belongs to version 0.5.13-rc1, while the one I just sent corresponds to version 0.5.12. 0.5.12 ...
To be sure, I reinstalled version 0.5.13-rc1 and repeated the test.  ``` zemin@ai-server:~$ journalctl -fu ollama Feb 28 15:55:24 ai-server ollama[1170322]: CUDA driver version: 12.8 Feb 28 15:55:24 ai-server...
### Observations from Logs 1. **Graph Splits**: - 0.5.12: `graph splits = 2` - 0.5.13-rc1: `graph splits = 86` 2. **Compute Buffer Size**: - 0.5.12: `CUDA_Host compute buffer size =...
Additionally, I'm not sure if it makes a difference, but I didn't compile it from source. I installed it using the following command: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=v0.5.13-rc1 sh
The same issue is present on Windows 11 as well. 0.5.12 Windows 11 RTX 4070 TI SUPER - 5900X  0.5.13-rc1 
Thanks for the update! Setting OLLAMA_FLASH_ATTENTION=0 restored the speed to normal. What is the reason for this? Unlike previous versions, should we keep FLASH_ATTENTION disabled in this release? 
Models with the same parameter count run significantly faster, and even models with a **larger** parameter count perform better than **Gemma 3**. Related issue: [GitHub Issue #9701](https://github.com/ollama/ollama/issues/9701)