jan
jan copied to clipboard
bug: Jan v0.72 → v0.73 Regression — Local API Server Fails to Start + llama.cpp Model Fails to Initialize (Apple M4)
Summary
Upgrading Jan from v0.72 to v0.73 introduced a regression where: 1. Local API server fails to start, immediately shutting down. 2. llama.cpp (Metal backend) fails to initialize models due to a mismatch between Flash Attention and quantized KV cache settings. 3. Identical models and settings that function correctly in v0.72 fail in v0.73.
This completely blocks local model execution on Apple Silicon (M4).
⸻
System • Hardware: Apple M4 • OS: macOS Tahoe 26.1 • Jan Version: 0.73 • llama.cpp Backend Build: B6929 / macos-arm64 • Backend: Metal (llama.cpp) • Previously working version: Jan 0.72 • Model: Jan-v1-4B-Q4_K_M (Qwen3 architecture) ⸻
Expected Behavior
Under v0.72: • Local API server starts and stays online • Metal backend initializes normally • Model loads successfully using default llama.cpp engine settings • Flash Attention auto-disables cleanly when unsupported • KV cache remains in f16 mode (no quantization) • Context initializes without error
⸻
Actual Behavior
Under v0.73, two major failures occur:
⸻
- Local API Server Does Not Start
The server attempts to bind: main: binding port with default address family main: HTTP server is listening, hostname: 127.0.0.1, port: 3557
But immediately after the model load fails, the server tears itself down before fully starting:
srv load_model: failed to load model ... srv operator(): cleaning up before exit... main: exiting due to model loading error
Result: • API server never becomes reachable • Server shuts down instantly due to backend error • This did not occur in Jan v0.72 (server stayed up even when models failed)
⸻
- llama.cpp Model Initialization Fails (Flash Attention + Quantized KV Cache Regression)
Full logs show Metal initializes correctly and GGUF metadata loads fully, but context creation fails:
llama_context: Flash Attention was auto, set to disabled llama_init_from_model: failed to initialize the context: quantized V cache was requested, but this requires Flash Attention
Additional context:
llama_context: layer 0 assigned to Metal but Flash Attention tensor assigned to CPU llama_context: Flash Attention auto → disabled
This indicates:
Regression • Jan v0.73 is incorrectly requesting quantized KV cache (q8/q4) • llama.cpp requires Flash Attention for quantized KV • But Flash Attention is auto-disabled on this architecture • → Context creation aborts • → Server exits • → Model cannot load
This exact model and configuration works in v0.72.
⸻
Key Log Excerpts
llama_context: Flash Attention was auto, set to disabled llama_init_from_model: failed to initialize the context: quantized V cache was requested, but this requires Flash Attention
srv load_model: failed to load model ... main: exiting due to model loading error
Metal initialization, GPU layer offloading, and KV cache allocation all succeed prior to failure.
⸻
Steps to Reproduce 1. Upgrade Jan from 0.72 → 0.73 2. Open a local model using the llama.cpp (Metal) backend 3. Use default engine settings (or import settings from v0.72) 4. Start model 5. Observe: • Local API server starts then instantly shuts down • Model fails to initialize with Flash Attention / KV cache error
⸻
Workarounds Attempted (None Effective) • Setting --main-gpu 0 • Resetting all llama.cpp engine settings • Disabling / enabling Flash Attention • Reducing context size (32768 → 8192) • Reducing n_gpu_layers • Lowering n_parallel • Changing KV Cache K/V types to f16
Jan v0.73 appears to ignore or override GUI KV Cache settings, still sending quantized KV to llama.cpp.
⸻
Impact • All llama.cpp models fail on Apple Silicon M4 • Local API server cannot start • Local model support is effectively unusable • Behavior is a regression from v0.72
⸻
Suspected Root Cause
A change in Jan v0.73 (or its llama.cpp integration) is: 1. Forcing or auto-enabling quantized KV cache 2. While Flash Attention is auto-disabled 3. Creating an invalid llama.cpp configuration: • Quantized KV requires Flash Attention • Flash Attention → disabled • Result → hard failure
This also causes the local API server to exit prematurely.
⸻
Suggested Fixes • Ensure Jan does not enable quantized KV cache when Flash Attention is unavailable • Correctly apply UI-based KV cache settings (f16, etc.) • Add version-safe migration to avoid importing incompatible config from v0.72 • Prevent server shutdown when model loading fails (as in v0.72) • Improve error contextualization in UI (Flash Attention mismatch)
⸻
Root Cause
Jan v0.73 passed an invalid configuration to llama.cpp: it requested a quantized KV V-cache (q8/q4) while Flash Attention was auto-disabled on Apple M4. Quantized KV requires Flash Attention, so llama.cpp aborted context initialization and the local API server never started.
⸻
Fix
Switching the llama.cpp engine settings back to full-precision KV cache (both K and V = f16) resolved the issue. Disabling “Quantized KV Cache” removed the Flash-Attention requirement, allowing the model to load and the API server to start normally.