Tufan Demir comments

Results 8 comments of


                                            Tufan Demir

The phi4-mini model cannot run properly

![Image](https://github.com/user-attachments/assets/a8f3c9b9-f474-4661-8820-4e6da6389515) ![Image](https://github.com/user-attachments/assets/db73d3a4-759d-47c9-99f6-d2f03e744ae2) Additionally, there is a significant slowdown in inference speed. While I was getting 148 tokens per second for Gemma 2 9B Q4 with an RTX 5090, now the...

Extreme drop in inference speed.

To avoid confusion, I deleted the previous log. The log from my first post belongs to version 0.5.13-rc1, while the one I just sent corresponds to version 0.5.12. 0.5.12 ![Image](https://github.com/user-attachments/assets/c18fa31e-a6f9-4024-88ca-c7bdea0cbe98)...

Extreme drop in inference speed.

To be sure, I reinstalled version 0.5.13-rc1 and repeated the test. ![Image](https://github.com/user-attachments/assets/8190de4c-8637-45be-abe3-ebc00d0aae9a) ``` zemin@ai-server:~$ journalctl -fu ollama Feb 28 15:55:24 ai-server ollama[1170322]: CUDA driver version: 12.8 Feb 28 15:55:24 ai-server...

Extreme drop in inference speed.

### Observations from Logs 1. **Graph Splits**: - 0.5.12: `graph splits = 2` - 0.5.13-rc1: `graph splits = 86` 2. **Compute Buffer Size**: - 0.5.12: `CUDA_Host compute buffer size =...

Extreme drop in inference speed.

Additionally, I'm not sure if it makes a difference, but I didn't compile it from source. I installed it using the following command: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=v0.5.13-rc1 sh

Extreme drop in inference speed.

The same issue is present on Windows 11 as well. 0.5.12 Windows 11 RTX 4070 TI SUPER - 5900X ![Image](https://github.com/user-attachments/assets/0447b61c-faca-44b3-a5e9-4d4bd646fc83) 0.5.13-rc1 ![Image](https://github.com/user-attachments/assets/78882ea1-e73a-4179-b8b8-232b36630ce5)

Extreme drop in inference speed.

Thanks for the update! Setting OLLAMA_FLASH_ATTENTION=0 restored the speed to normal. What is the reason for this? Unlike previous versions, should we keep FLASH_ATTENTION disabled in this release? ![Image](https://github.com/user-attachments/assets/cda56de0-d713-4df2-9072-50488dac2815)

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache.

Models with the same parameter count run significantly faster, and even models with a **larger** parameter count perform better than **Gemma 3**. Related issue: [GitHub Issue #9701](https://github.com/ollama/ollama/issues/9701)