jan
jan copied to clipboard
feat: suggest / apply recommended llama.cpp settings base on computer specs
Jan version
0.6.0
Describe the Bug
I tried loading the Gemma 3 models after the update and noticed that has much slower token generation speed, and saw that it uses the CPU instead. I tried loading other models and they are fine. This happens on both the 4b and 12b versions, I didn't try the other versions. GPU Layers are at 100.
Steps to Reproduce
Download the Gemma3 models, load and tell it to generate a test message.
Screenshots / Logs
What is your OS?
- [ ] MacOS
- [x] Windows
- [ ] Linux
I agree that the text generation is much slower compared to the speeds I was getting beforehand. Also I noticed "System Monitor" was moved into the settings, but it throws an error when attempting to open it so I'm not able to see what's bottlenecking.
hi @mageeagle can you please help check if GPU is enabled here?
hi @yukkidev may I know if you're using the latested version v0.6.1?
hi @mageeagle can you please help check if GPU is enabled here?
Yes it is, as stated, other models are all fine.
@qnixsynapse @louis-menlo you got any idea sounds like a model problem?
Hi @mageeagle can you help us share the cortex.log file from app data folder. We will take a look then.
Hi @mageeagle can you help me go to the Settings > Providers > Llama.cpp Then disable both: Flash Attention and Caching, run the model again and share us the result of before and after doing this.
Here's the one for default settings: cortex_default.log
Here's the one with Flash Attention and Caching disabled: cortex_disabled_flash_cache.log
With flash attention and caching disabled, the model outputs around 40 tokens/s, at least 2 times faster than the default settings, and I no longer see significant CPU activity.
Awesome, thanks @mageeagle. Please note that without these settings enabled, it may not work for some hardware and large conversations.
I'll convert this into a feat: where the app can detect your hardware and suggest recommended settings. cc @qnixsynapse
Though I tried the same test (default and caching off) on another model (codestral 22b), there is no difference in speed of token generation (both at 27 tokens/s), just checking in if it's really not a model issue?
@mageeagle, better test on smaller models where they can all offload to GPU. With very large models that can't fit into VRAM, they can offload to CPU, making the difference less noticeable.
I just did some testing with 3 smaller models, only Gemma3 shows difference with the settings change Logs for your reference:
Gemma 3 4B: Default setting is outputting 40 tokens/s and speed degrades quickly to 20/s, CPU load rises to 70% during its output Default: cortex_default-gemma4b.log
No Cache: With flash attention and caching disabled, it outputs 90 tokens/s with no significant CPU usage increase cortex_disabled-gemma4b.log
Qwen2.5 Coder 7B: No significant difference in speed, no significant CPU usage increase Default: cortex_default-quen-7b.log
No Cache: cortex_disabled-quen-7b.log
Qwen2.5 Coder 14B: No significant difference in speed, no significant CPU usage increase Default: cortex_default-quen-14b.log
No Cache: cortex_disabled-quen-14b.log
Please note: no need to disable prompt caching. The culprit here is flash attention. If not supported by the backend, the attention layers will be offloaded to CPU for prompt processing.
Noted, though the testing was done also with flash attention enabled and disabled, so it should be a valid comparison
