jan icon indicating copy to clipboard operation
jan copied to clipboard

feat: suggest / apply recommended llama.cpp settings base on computer specs

Open mageeagle opened this issue 4 months ago • 15 comments
trafficstars

Jan version

0.6.0

Describe the Bug

I tried loading the Gemma 3 models after the update and noticed that has much slower token generation speed, and saw that it uses the CPU instead. I tried loading other models and they are fine. This happens on both the 4b and 12b versions, I didn't try the other versions. GPU Layers are at 100.

Steps to Reproduce

Download the Gemma3 models, load and tell it to generate a test message.

Screenshots / Logs

Image

What is your OS?

  • [ ] MacOS
  • [x] Windows
  • [ ] Linux

mageeagle avatar Jun 19 '25 16:06 mageeagle

I agree that the text generation is much slower compared to the speeds I was getting beforehand. Also I noticed "System Monitor" was moved into the settings, but it throws an error when attempting to open it so I'm not able to see what's bottlenecking.

Image

yukkidev avatar Jun 19 '25 18:06 yukkidev

hi @mageeagle can you please help check if GPU is enabled here?

Image

david-menloai avatar Jun 20 '25 05:06 david-menloai

hi @yukkidev may I know if you're using the latested version v0.6.1?

david-menloai avatar Jun 20 '25 05:06 david-menloai

hi @mageeagle can you please help check if GPU is enabled here? Image

Yes it is, as stated, other models are all fine.

Image

mageeagle avatar Jun 20 '25 09:06 mageeagle

@qnixsynapse @louis-menlo you got any idea sounds like a model problem?

david-menloai avatar Jun 20 '25 14:06 david-menloai

Hi @mageeagle can you help us share the cortex.log file from app data folder. We will take a look then.

louis-jan avatar Jun 20 '25 14:06 louis-jan

cortex.log

Here you go, I copied only the logs from the session with gemma. Thanks.

mageeagle avatar Jun 20 '25 18:06 mageeagle

Hi @mageeagle can you help me go to the Settings > Providers > Llama.cpp Then disable both: Flash Attention and Caching, run the model again and share us the result of before and after doing this.

louis-jan avatar Jun 22 '25 05:06 louis-jan

Here's the one for default settings: cortex_default.log

Here's the one with Flash Attention and Caching disabled: cortex_disabled_flash_cache.log

With flash attention and caching disabled, the model outputs around 40 tokens/s, at least 2 times faster than the default settings, and I no longer see significant CPU activity.

mageeagle avatar Jun 22 '25 08:06 mageeagle

Awesome, thanks @mageeagle. Please note that without these settings enabled, it may not work for some hardware and large conversations.

I'll convert this into a feat: where the app can detect your hardware and suggest recommended settings. cc @qnixsynapse

louis-jan avatar Jun 22 '25 09:06 louis-jan

Though I tried the same test (default and caching off) on another model (codestral 22b), there is no difference in speed of token generation (both at 27 tokens/s), just checking in if it's really not a model issue?

mageeagle avatar Jun 22 '25 09:06 mageeagle

@mageeagle, better test on smaller models where they can all offload to GPU. With very large models that can't fit into VRAM, they can offload to CPU, making the difference less noticeable.

louis-jan avatar Jun 22 '25 10:06 louis-jan

I just did some testing with 3 smaller models, only Gemma3 shows difference with the settings change Logs for your reference:

Gemma 3 4B: Default setting is outputting 40 tokens/s and speed degrades quickly to 20/s, CPU load rises to 70% during its output Default: cortex_default-gemma4b.log

No Cache: With flash attention and caching disabled, it outputs 90 tokens/s with no significant CPU usage increase cortex_disabled-gemma4b.log

Qwen2.5 Coder 7B: No significant difference in speed, no significant CPU usage increase Default: cortex_default-quen-7b.log

No Cache: cortex_disabled-quen-7b.log

Qwen2.5 Coder 14B: No significant difference in speed, no significant CPU usage increase Default: cortex_default-quen-14b.log

No Cache: cortex_disabled-quen-14b.log

mageeagle avatar Jun 22 '25 10:06 mageeagle

Please note: no need to disable prompt caching. The culprit here is flash attention. If not supported by the backend, the attention layers will be offloaded to CPU for prompt processing.

qnixsynapse avatar Jun 22 '25 10:06 qnixsynapse

Noted, though the testing was done also with flash attention enabled and disabled, so it should be a valid comparison

mageeagle avatar Jun 22 '25 10:06 mageeagle