Chen Hongtao

Results 7 comments of Chen Hongtao

@PhzCode 能否提供一下以下信息: 1. 出错的case中使用的是什么CPU backend,AMX还是llamafile? 2. 报错的那一条请求的prompt是什么,或者长度是多少token?

The issue is that you used `--kt-num-gpu-experts 16`, which specifies that each layer has 16 experts on the GPU. The 24GB VRAM can’t handle that, so try lowering this parameter.

I see. The native Kimi-K2-Thinking model uses BF16-precision (non-expert) weights on the GPU side, so it consumes more VRAM than DeepSeek-V3/R1. A 24 GB GPU isn’t sufficient. You may consider...

We provide the [scripts](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/scripts) to perform quantization. But by default, we quantize expert weights but not non-expert weights. However, you currently only need to quantize the non-expert weights. I guess...

这部分代码尚未合并到主分支,合并工作正在进行中。可以先看[sosp25-ae分支](https://github.com/kvcache-ai/ktransformers/tree/sosp25-ae/sosp25-ae)。 This part of the code has not yet been merged into the main branch, and the merging process is ongoing. You can refer to the [sosp25-ae branch](https://github.com/kvcache-ai/ktransformers/tree/sosp25-ae/sosp25-ae) in the...

KTransformers is refactored and the YAML-based flexible injection framework is currently deprecated. The inference part now resides on [kt-kernel](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel) and is recommended to be launched with SGLang. When launching SGLang...