[Bug] New Update breaks Inference for Balance Serve with Cache (gpu_only: false)

Open trilog-inc opened this issue 5 months ago • 1 comments

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

Hi, It looks like the latest version of Ktransformers ( SmallThinker and GLM4-MoE update ) breaks the kvc2 cache. After a successful install, I set the kvc2 config to:

kvc2: gpu_only: true utilization_percentage: 1.0 cpu_memory_size_GB: 500 disk_path: /mnt/data/persist-kvs

and every model I load and try to do inference on gives me this error in the rpc.log file:

[2025-07-30 19:35:15.530] [info] [prefix.cpp:1303] No Match, No need to load [2025-07-30 19:35:15.530] [error] [prefix.cpp:1336] GPU Cache Layer Count not match python3: /mnt/home_extend/llm/ktransglm/ktransformers/csrc/balance_serve/kvc2/src/prefix.cpp:1337: virtual void kvc2::KVC2::lookup_to_gpu_async(ModelName, QuantType, kvc2::Token*, kvc2::TokenLength, kvc2::TokenLength, std::function<void(std::shared_ptrkvc2::DoubleCacheHandleInterface)>): Assertion `false' failed.

with some debug prints, it looks like the "h->k_info().hidden_layer_count()" on line 1335 of prefix.cpp does not properly return the layer count of the model. In my tests it returns 1.

Any ideas what it could be?

Reproduction

Environment

W7-3455 512GB DDR5 RTX 4090 ( CUDA 12.8 )

Jul 28 '25 19:07 trilog-inc

Are you trying to run long context with ktransformers which is based upon the old version of flashinfer (which is itself very buggy?). Please skip this straight to ik_llama.cpp

Aug 23 '25 20:08 magikRUKKOLA