🚨FAQs | 常见问题🚨
[!NOTE] Please avoid creating issues regarding the following questions, as they might be closed without a response. 请避免创建与下述问题有关的 issues,这些 issues 可能不会被回复。
[!TIP] Documentation: https://kvcache-ai.github.io/ktransformers/ 中文文档:https://github.com/kvcache-ai/ktransformers/tree/main/doc/zh
Most of problems / 大多数问题
How to install kt-kernel/ 怎么安装 kt-kernel
Please update repository and install again following the link below: 请参考文档进行安装,以及常见问题指导: https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md
Inference with kt-kernel+Sglang/ 在 kt-kernel + Sglang 上推理
1. Install SGLang
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
2. Prepare Weights
You need both GPU weights and CPU weights for heterogeneous inference:
GPU Weights: Use the original / quantized model weights.
CPU Weights: Quantize to AMX-optimized format using the conversion script:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \ # Depends on your GPU weights type: fp8, fp16, or bf16
--output /path/to/cpu-weights \
--quant-method int8 # or int4
Supported input formats: FP8, FP16, BF16 → INT4/INT8.
For more details, see:
Note: LLAMAFILE backend supports GGUF format directly, but this feature is still in preview.
3. Launch SGLang Server
Start the SGLang server with your normal SGLang parameters, and add the following KT-Kernel specific parameters to enable CPU-GPU heterogeneous inference:
KT-Kernel Parameters to Add:
--kt-method: Backend method (AMXINT4, AMXINT8, or LLAMAFILE)--kt-weight-path: Path to the converted CPU weights--kt-cpuinfer: Number of CPU inference threads (set to physical cores)--kt-threadpool-count: Number of thread pools (set to NUMA node count)--kt-num-gpu-experts: Number of experts to keep on GPU--kt-max-deferred-experts-per-token: Deferred experts for pipelined execution
Example:
python -m sglang.launch_server \
[your normal SGLang parameters...] \
--kt-method AMXINT8 \
--kt-weight-path /path/to/cpu-weights \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32 \
--kt-max-deferred-experts-per-token 2
Support models/ 支持的模型列表
| Model_name/ 模型名称 | have_tested/ 是否测试过 |
|---|---|
| GLM-4.5-Air | :white_check_mark: |
| GLM-4.5 | :white_check_mark: |
| Qwen3-30B-A3B | :white_check_mark: |
| Qwen3-235B-A22B-Thinking-2507 | :white_check_mark: |
| Qwen3-235B-A22B-Instruct-2507 | :white_check_mark: |
| Qwen3-Next-80B-A3B-Thinking | :white_check_mark: |
| DeepSeek-R1-0528 | :white_check_mark: |
[!NOTE] In principle, we can support any model that SGLang supports. If a model cannot run, then SGLang running purely on GPU likely won’t be able to either. So if you test additional models that work, feel free to share them.
[!TIP] If the problems still exist with the latest code, please create an issue. 若使用最新的代码仍然无法解决问题,请创建一个 issue。
We need a tracking matrix of model support. For example, the currenct staus of GLM series model, what kind of weight are supported, what are not.
You are welcome to report new models that are supported in the kt_kernel with your snapshot to prove