jan icon indicating copy to clipboard operation
jan copied to clipboard

bug: Request timeout/cancellation with slow inference models (1-4 t/s) [Critical]

Open TbilisiHeat opened this issue 3 months ago • 3 comments

Version: Jan.ai v0.6.9

Describe the Bug Jan.ai consistently terminates long-form AI generation with "Request Cancelled" error after approximately 120-200 seconds of inference, regardless of model performance or system resources. This timeout appears to be a hard-coded frontend limitation that prevents completion of extended reasoning tasks, mathematical problems, or creative writing that require sustained generation. The same models run perfectly in KoboldCpp (which uses identical llama.cpp backend) without any timeout issues, suggesting this is a Jan.ai architectural problem rather than a model or hardware limitation.

Steps to Reproduce

Load any large language model (tested with GLM-4.5-AIR IQ4_XS, ~60GB) Configure model with standard settings (tested multiple batch size configurations) Submit a complex prompt requiring 500+ tokens of response (philosophical reasoning, long-form creative writing, mathematical problems) Observe that model begins generating at normal speed (1-4 tokens/second) After ~120-200 seconds of generation, Jan.ai displays "Request Cancelled" error and terminates response Attempt same prompt in KoboldCpp with identical model - completes successfully without timeout

Screenshots / Logs Attempted Configuration Fixes (All Failed):

Reduced batch_size from 2048 to 256 Reduced ubatch_size from 512 to 128 Set environmental variables: LLAMA_SERVER_TIMEOUT=3600, LLAMA_REQUEST_TIMEOUT=3600 Modified model YAML and JSON configs with timeout parameters Disabled dynamic batching Reduced context size to 2048

Error Behavior:

Timeout occurs regardless of available system resources (70GB+ free RAM) Model continues generating normally until sudden cancellation No memory pressure, swap usage, or CPU throttling observed Issue persists across different model sizes and quantization levels

Validation via Alternative Client:

Same model in KoboldCpp: Generates 2000+ tokens without issues Same hardware, same llama.cpp backend, same model files Only difference is Jan.ai vs KoboldCpp frontend

Operating System

Linux (Manjaro - primary testing) Linux (LMDE 6 on AMD 4800H laptop - confirmed replication)

Hardware Tested:

Primary: AMD Threadripper 1950X, 128GB DDR4-2133, AMD RX480 (GPU disabled) Secondary: AMD 4800H laptop, 16GB RAM (both iGPU and CPU-only modes affected)

Performance Context:

Models tested: GLM-4.5-AIR IQ4_XS, Anubis-Pro-105B, BigTiger-27B Inference speeds: 1-4 tokens/second (hardware bandwidth limited) All models exhibit identical timeout behavior in Jan.ai All models work perfectly in KoboldCpp/raw llama.cpp

Impact: This bug makes Jan.ai unsuitable for any use case requiring extended generation, including academic research, creative writing, complex reasoning tasks, or technical analysis. Users are forced to switch to alternative frontends to access full model capabilities.


Operating System

  • [ ] MacOS
  • [ ] Windows
  • [ X] Linux
Image

TbilisiHeat avatar Sep 04 '25 10:09 TbilisiHeat