ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

🚨FAQs | 常见问题🚨

Open KMSorSMS opened this issue 1 month ago • 2 comments

[!NOTE] Please avoid creating issues regarding the following questions, as they might be closed without a response. 请避免创建与下述问题有关的 issues,这些 issues 可能不会被回复。

[!TIP] Documentation: https://kvcache-ai.github.io/ktransformers/ 中文文档:https://github.com/kvcache-ai/ktransformers/tree/main/doc/zh


Most of problems / 大多数问题

How to install kt-kernel/ 怎么安装 kt-kernel

Please update repository and install again following the link below: 请参考文档进行安装,以及常见问题指导: https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md

Inference with kt-kernel+Sglang/ 在 kt-kernel + Sglang 上推理

1. Install SGLang
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
2. Prepare Weights

You need both GPU weights and CPU weights for heterogeneous inference:

GPU Weights: Use the original / quantized model weights.

CPU Weights: Quantize to AMX-optimized format using the conversion script:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \  # Depends on your GPU weights type: fp8, fp16, or bf16
  --output /path/to/cpu-weights \
  --quant-method int8  # or int4

Supported input formats: FP8, FP16, BF16 → INT4/INT8.

For more details, see:

Note: LLAMAFILE backend supports GGUF format directly, but this feature is still in preview.

3. Launch SGLang Server

Start the SGLang server with your normal SGLang parameters, and add the following KT-Kernel specific parameters to enable CPU-GPU heterogeneous inference:

KT-Kernel Parameters to Add:

  • --kt-method: Backend method (AMXINT4, AMXINT8, or LLAMAFILE)
  • --kt-weight-path: Path to the converted CPU weights
  • --kt-cpuinfer: Number of CPU inference threads (set to physical cores)
  • --kt-threadpool-count: Number of thread pools (set to NUMA node count)
  • --kt-num-gpu-experts: Number of experts to keep on GPU
  • --kt-max-deferred-experts-per-token: Deferred experts for pipelined execution

Example:

python -m sglang.launch_server \
  [your normal SGLang parameters...] \
  --kt-method AMXINT8 \
  --kt-weight-path /path/to/cpu-weights \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

Support models/ 支持的模型列表

Model_name/ 模型名称 have_tested/ 是否测试过
GLM-4.5-Air :white_check_mark:
GLM-4.5 :white_check_mark:
Qwen3-30B-A3B :white_check_mark:
Qwen3-235B-A22B-Thinking-2507 :white_check_mark:
Qwen3-235B-A22B-Instruct-2507 :white_check_mark:
Qwen3-Next-80B-A3B-Thinking :white_check_mark:
DeepSeek-R1-0528 :white_check_mark:

[!NOTE] In principle, we can support any model that SGLang supports. If a model cannot run, then SGLang running purely on GPU likely won’t be able to either. So if you test additional models that work, feel free to share them.


[!TIP] If the problems still exist with the latest code, please create an issue. 若使用最新的代码仍然无法解决问题,请创建一个 issue。

KMSorSMS avatar Nov 13 '25 11:11 KMSorSMS

We need a tracking matrix of model support. For example, the currenct staus of GLM series model, what kind of weight are supported, what are not.

james0zan avatar Nov 13 '25 13:11 james0zan

You are welcome to report new models that are supported in the kt_kernel with your snapshot to prove

KMSorSMS avatar Nov 13 '25 13:11 KMSorSMS