[Feature] KTransformers Integration to Support CPU/GPU Hybrid Inference for MoE Models
Motivation
While hybrid CPU/GPU inference alleviates memory constraints by leveraging CPU DRAM alongside GPU VRAM bandwidth, achieving high throughput remains challenging due to synchronization overheads and limited CPU compute efficiency. This PR upstreams the KTransformers approach (SOSP ’25), enabling GPU Tensor Parallelism + CPU/GPU Hybrid Expert Parallelism for MoE models—supporting hybrid prefilling and decoding that utilize kt-kernel AMX-optimized CPUs together with GPUs. With this design, dense layers benefit from high-throughput multi-GPU execution, while experts are flexibly scheduled across both CPUs and GPUs, maximizing hardware utilization and reducing bottlenecks. KTransformers will be incorporated into SGLang as a library backend. Building on this backend, sglang generalizes the design to support multi-GPU tensor parallelism and CPU/GPU hybrid expert parallelism, while broadening coverage to additional models and weight formats.
Benchmark Results (Preview)
These are preliminary results of KTransformers with a single GPU, reflecting the performance of this feature in a single-card setting. More detailed benchmark data will be provided in follow-up updates. The figures below show the throughput performance of KTransformers on a dual-socket server with Intel® Xeon® Platinum 8452Y CPUs (36 cores × 2, 1 TB DDR5), equipped with an NVIDIA A100 (40 GB) for full-precision models and an NVIDIA RTX 4080 (16 GB) for quantized models. We evaluate on DeepSeek-V3-0324 (DS-3), DeepSeek-V2.5-1210 (DS-2), and Qwen2-57B-A14B (QW-2), comparing KTransformers against Llama.cpp and Fiddler across both the prefill and decode phases.
In the prefill phase, KTransformers consistently outperforms both baselines across all prompt lengths. While Llama.cpp shows advantages in short-prompt scenarios through aggressive operator fusion, and Fiddler benefits from AMX acceleration for long prompts, KTransformers surpasses both by leveraging AMX-optimized CPU kernels and improved CPU/GPU coordination. For example, our CPU MoE kernel achieves 21.3 TFLOPS on DS-3, a 3.98× improvement over the PyTorch baseline.
In the decode phase, KTransformers (without Expert Deferral) achieves 2.42×–4.09× speedups over Fiddler and 1.25×–1.76× over Llama.cpp on full-precision models. With quantized models, the gains are even larger (1.77×–1.93× vs. Llama.cpp), primarily due to reduced kernel execution time and our efficient CUDA Graph-based scheduling, which reduces GPU launch overhead from over 20% to nearly zero.
Roadmap
- Hybrid inference with compressed tensor format + AMX kernel integration + CUDA Graph support. https://github.com/sgl-project/sglang/pull/11487
- Support hybrid quant config.
- Support more weight formats (eg. GPTQ, AWQ).
- refactor to use experts_map instead of num_gpu_experts.
- Avoid padding when using DP Attention.
- Hotness aware expert distribution.
- Experts deferral. https://github.com/sgl-project/sglang/pull/12586
- Add unit tests.
- Add tutorial and deployment guide.
- Supporting speculative decoding.
- Support more models (eg. Qwen3, GLM4.5, Kimi-K2)
Related resources
Repo:https://github.com/kvcache-ai/ktransformers SOSP25 Paper:https://madsys.cs.tsinghua.edu.cn/publication/ktransformers-unleashing-the-full-potential-of-cpu/gpu-hybrid-inference-for-moe-models/
CC: @Atream @ovowei @chenht2022 @Azure-Tang @ErvinXie
Why is this integration AMX exclusive? The original KTransformers library was not.
Other platform support will also be upstreamed
Hello, and thank you for this! Finally an inference backend that can fully utilize my dual Xeon + single GPU setup.
I was able to download the Deepseek-v3-0324 quants from modelscope and deploy them locally, but I was wondering if you would be able to provide additional guidance on how to correctly convert model weights to use with this backend?
@intervitens managed to create quants identical to the ones provided on modelscope for GLM-4.5-Air by using llm-compressor with the included yaml recipe for the GPU portion, and the provided script for the CPU portion (which does have a typo where "subpool" should be "threadpool").
However we've been hitting some roadblocks with Deepseek. I modified the script to save safetensors files per layer rather than accumulating them until the end (or else Deepseek conversion requires more than 768gb RAM), and managed to convert the original fp8 Deepseek-V3-0324 to an INT4 quant that will load and infer without error, however produces no valid output - just empty string - possibly an overflow during inference? The resulting layer tensors are also not hash-identical like the ones for GLM-4.5-Air.
Update: It seems that something is broken with the conversion script's fp8 input support. Dequanting Deepseek to bf16 first seems to work.
Update 2: Uploaded some compatible weights to HF for anyone interested to try out.
Update 3: We seem to be having some problems with poor output quality - most noticeably with occasional misspelled words during generation. I initially attributed this to the RTN int4 experts, but it still seems to happen with int8 experts. Intervitens also got an aime24 score of 0.466 using fp8 non-expert params on GPU with int8 experts on CPU, while various Openrouter providers range from 0.5-0.63 for Deepseek V3 0324.
Hello, and thank you for this! Finally an inference backend that can fully utilize my dual Xeon + single GPU setup.
I was able to download the Deepseek-v3-0324 quants from modelscope and deploy them locally, but I was wondering if you would be able to provide additional guidance on how to correctly convert model weights to use with this backend?
@intervitens managed to create quants identical to the ones provided on modelscope for GLM-4.5-Air by using
llm-compressorwith the included yaml recipe for the GPU portion, and the provided script for the CPU portion (which does have a typo where "subpool" should be "threadpool").However we've been hitting some roadblocks with Deepseek. I modified the script to save safetensors files per layer rather than accumulating them until the end (or else Deepseek conversion requires more than 768gb RAM), and managed to convert the original fp8 Deepseek-V3-0324 to an INT4 quant that will load and infer without error, however produces no valid output - just empty string - possibly an overflow during inference? The resulting layer tensors are also not hash-identical like the ones for GLM-4.5-Air.
Update: It seems that something is broken with the conversion script's fp8 input support. Dequanting Deepseek to bf16 first seems to work.
Update 2: Uploaded some compatible weights to HF for anyone interested to try out.
Update 3: We seem to be having some problems with poor output quality - most noticeably with occasional misspelled words during generation. I initially attributed this to the RTN int4 experts, but it still seems to happen with int8 experts. Intervitens also got an aime24 score of 0.466 using fp8 non-expert params on GPU with int8 experts on CPU, while various Openrouter providers range from 0.5-0.63 for Deepseek V3 0324.
@ovowei I guess we haven't supported RTN int4 or int8 converted from fp8, as we need to write the fp8 dequantization, which is missing. And seems the int8 support has something wrong. We will fix it soon. @Atream
@ovowei I guess we haven't supported RTN int4 or int8 converted from fp8, as we need to write the fp8 dequantization, which is missing. And seems the int8 support has something wrong. We will fix it soon. @Atream
For the record, I also encountered the problem with degraded-appearing outputs and frequent misspelled words using the Deepseek V3 0324 INT4 quants provided on modelscope (in conjunction with the fp8+GPTQ4 GPU weights) - which I initially just attributed to quality loss from RTN channel-wise INT4 quantization - until I encountered the same problems using our own converted INT8 weights.
We did push the weights (and mirrored some of the modelscope weights) to HuggingFace if you'd like to inspect them: https://huggingface.co/CPU-Hybrid-MoE
This is the first backend that can really utilize all of the hardware on my dual Xeon + RTX Pro 6000 setup, so I'm definitely looking forward to future updates.
For the record, I also encountered the problem with degraded-appearing outputs and frequent misspelled words using the Deepseek V3 0324 INT4 quants provided on modelscope (in conjunction with the fp8+GPTQ4 GPU weights) - which I initially just attributed to quality loss from RTN channel-wise INT4 quantization - until I encountered the same problems using our own converted INT8 weights.
We did push the weights (and mirrored some of the modelscope weights) to HuggingFace if you'd like to inspect them: https://huggingface.co/CPU-Hybrid-MoE
This is the first backend that can really utilize all of the hardware on my dual Xeon + RTX Pro 6000 setup, so I'm definitely looking forward to future updates.
There is a bug that the kt-amx-method choice comes from env directly, which means the command line arg is not working. But since your output is not nonsense words, I guess you have set it correctly by env. We will try to locate if the int8 kernel has something wrong, and also give official weights for you to check soon.
There is a bug that the
kt-amx-methodchoice comes from env directly, which means the command line arg is not working. But since your output is not nonsense words, I guess you have set it correctly by env. We will try to locate if the int8 kernel has something wrong, and also give official weights for you to check soon.
Yes, export AMX_METHOD=AMXINT8 was necessary to get it to convert or infer INT8 CPU weights at all (or else it produces absolute gibberish). The outputs I was getting with either INT4 or INT8 were coherent text for the most part, but with significant degradation in quality including misspelled words even with truncation of min-p 0.1 and top-p 0.9 with a temperature of 1.
Some examples are like: "raucous peal of lafter" "mischvous ardor" "cruitial moment"
I thought perhaps it was a tokenizer issue as the included tokenizer with the GPU weights did not match the size/hash of the original DeepSeek V3 0324, but replacing the tokenizer and reloading the model did not entirely eliminate this problem.
Yes,
export AMX_METHOD=AMXINT8was necessary to get it to convert or infer INT8 CPU weights at all (or else it produces absolute gibberish). The outputs I was getting with either INT4 or INT8 were coherent text for the most part, but with significant degradation in quality including misspelled words even with truncation of min-p 0.1 and top-p 0.9 with a temperature of 1.Some examples are like: "raucous peal of lafter" "mischvous ardor" "cruitial moment"
I thought perhaps it was a tokenizer issue as the included tokenizer with the GPU weights did not match the size/hash of the original DeepSeek V3 0324, but replacing the tokenizer and reloading the model did not entirely eliminate this problem.
What's the GPU weights you are using? (fp8+GPTQ4 GPU weights from there: https://modelscope.cn/models/ApproachingAI2024/DeepSeek-V3-0324-GPU-weight/files ?).
And thanks for your discovery and help. We are going to check this and do some tests to figure it out.
What's the GPU weights you are using? (fp8+GPTQ4 GPU weights from there: https://modelscope.cn/models/ApproachingAI2024/DeepSeek-V3-0324-GPU-weight/files ?).
And thanks for your discovery and help. We are going to check this and do some tests to figure it out.
Yes, those, with either the INT4 NUMA2 CPU weights provided on modelscope, or our quanted INT8 NUMA4 CPU weights that are uploaded to HuggingFace.
Also is glm4moe considered a working architecture on this backend? I see that there are weights for GLM 4.5 and GLM 4.5 Air. Intervitens tried to replicate the quant method on GLM 4.6 using this GPTQ recipe. However even with bf16 non-expert GPU weights and INT8 CPU weights, he ran into issues with the output degrading into infinite looping in the thinking phase.
My launch command was:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
AMX_METHOD=AMXINT8 \
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 5000 \
--model /mnt/data/models/DeepSeek-V3-0324-GPU-FP8-GPTQ4 \
--kt-amx-weight-path /mnt/data/models/DeepSeek-V3-0324-CPU-NUMA4-AMXINT8 \
--kt-cpuinfer 128 \
--kt-threadpool-count 4 \
--kt-num-gpu-experts 0 \
--kt-amx-method AMXINT8 \
--attention-backend flashinfer \
--trust-remote-code \
--mem-fraction-static 0.98 \
--chunked-prefill-size 4096 \
--max-running-requests 1 \
--context-length 131072 \
--max-total-tokens 131072 \
--served-model-name DeepSeek-V3-0324 \
--enable-mixed-chunk \
--disable-shared-experts-fusion
Note:
SGLANG_ENABLE_JIT_DEEPGEMM=0seems to be needed for RTX Pro 6000 Blackwell because otherwisesglangtries to use DeepGEMM which only supports sm_90 and sm_100- I had to switch to the flashinfer attention backend because the triton attention backend's kernels require too much shared memory for sm_120 and fail to compile (possibly also some assumptions about sm_120 behaving like sm_90 or sm_100 when it's actually more similar to sm_89)
- It's a dual socket Xeon server with 2 SNCs per socket, thus the configuration being 4 NUMA nodes by default
Yes, those, with either the INT4 NUMA2 CPU weights provided on modelscope, or our quanted INT8 NUMA4 CPU weights that are uploaded to HuggingFace.
Also is
glm4moeconsidered a working architecture on this backend? I see that there are weights for GLM 4.5 and GLM 4.5 Air. Intervitens tried to replicate the quant method on GLM 4.6 using this GPTQ recipe. However even with bf16 non-expert GPU weights and INT8 CPU weights, he ran into issues with the output degrading into infinite looping in the thinking phase.My launch command was:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ SGLANG_ENABLE_JIT_DEEPGEMM=0 \ AMX_METHOD=AMXINT8 \ python -m sglang.launch_server \ --host 0.0.0.0 \ --port 5000 \ --model /mnt/data/models/DeepSeek-V3-0324-GPU-FP8-GPTQ4 \ --kt-amx-weight-path /mnt/data/models/DeepSeek-V3-0324-CPU-NUMA4-AMXINT8 \ --kt-cpuinfer 128 \ --kt-threadpool-count 4 \ --kt-num-gpu-experts 0 \ --kt-amx-method AMXINT8 \ --attention-backend flashinfer \ --trust-remote-code \ --mem-fraction-static 0.98 \ --chunked-prefill-size 4096 \ --max-running-requests 1 \ --context-length 131072 \ --max-total-tokens 131072 \ --served-model-name DeepSeek-V3-0324 \ --enable-mixed-chunk \ --disable-shared-experts-fusionNote:
SGLANG_ENABLE_JIT_DEEPGEMM=0seems to be needed for RTX Pro 6000 Blackwell because otherwisesglangtries to use DeepGEMM which only supports sm_90 and sm_100- I had to switch to the flashinfer attention backend because the triton attention backend's kernels require too much shared memory for sm_120 and fail to compile (possibly also some assumptions about sm_120 behaving like sm_90 or sm_100 when it's actually more similar to sm_89)
- It's a dual socket Xeon server with 2 SNCs per socket, thus the configuration being 4 NUMA nodes by default
Great. And yes, we haven't supported the glm4.5 on this release, we will add more modules after our features and code are stable.
Which ktransformer commit is used to run all above experiments?
Which ktransformer commit is used to run all above experiments?
If you mean my testing where I ran into the slight incoherence issues with Deepseek V3 0324, I was using kt-kernel from https://github.com/kvcache-ai/ktransformers/commit/d7ab3b41b7c80eb55ac012099aba8e35991f0b3c.
Hello, and thank you for this! Finally an inference backend that can fully utilize my dual Xeon + single GPU setup.
I was able to download the Deepseek-v3-0324 quants from modelscope and deploy them locally, but I was wondering if you would be able to provide additional guidance on how to correctly convert model weights to use with this backend?
@intervitens managed to create quants identical to the ones provided on modelscope for GLM-4.5-Air by using
llm-compressorwith the included yaml recipe for the GPU portion, and the provided script for the CPU portion (which does have a typo where "subpool" should be "threadpool").However we've been hitting some roadblocks with Deepseek. I modified the script to save safetensors files per layer rather than accumulating them until the end (or else Deepseek conversion requires more than 768gb RAM), and managed to convert the original fp8 Deepseek-V3-0324 to an INT4 quant that will load and infer without error, however produces no valid output - just empty string - possibly an overflow during inference? The resulting layer tensors are also not hash-identical like the ones for GLM-4.5-Air.
Update: It seems that something is broken with the conversion script's fp8 input support. Dequanting Deepseek to bf16 first seems to work.
Update 2: Uploaded some compatible weights to HF for anyone interested to try out.
Update 3: We seem to be having some problems with poor output quality - most noticeably with occasional misspelled words during generation. I initially attributed this to the RTN int4 experts, but it still seems to happen with int8 experts. Intervitens also got an aime24 score of 0.466 using fp8 non-expert params on GPU with int8 experts on CPU, while various Openrouter providers range from 0.5-0.63 for Deepseek V3 0324.
Thank you for the information. I am wondering how much CPU memory is needed to run deepseek v3 0324. Is 256 GB enough?
Thank you for the information. I am wondering how much CPU memory is needed to run deepseek v3 0324. Is 256 GB enough?
Definitely not. For the full fp8+int8 model you'd need around 768gb. Also you probably don't want to replicate the setup I tested on, because the text outputs were subtly (but noticeably) degraded. I haven't tested whether any of the newer commits to sglang or kt-kernel fix this.
Update: Pulled everything to test https://github.com/sgl-project/sglang/commit/108647311125e44f93ae087253b0c47bff42e994 and https://github.com/kvcache-ai/ktransformers/commit/94c25626dca35a5fb39fe6d83f3c51a40efe2027 and sglang is hitting an AttributeError while starting to load the GPU weights, so will need more troubleshooting lol.
Thank you for the information. I am wondering how much CPU memory is needed to run deepseek v3 0324. Is 256 GB enough?
Definitely not. For the full fp8+int8 model you'd need around 768gb. Also you probably don't want to replicate the setup I tested on, because the text outputs were subtly (but noticeably) degraded. I haven't tested whether any of the newer commits to
sglangorkt-kernelfix this.Update: Pulled everything to test 1086473 and kvcache-ai/ktransformers@94c2562 and sglang is hitting an AttributeError while starting to load the GPU weights, so will need more troubleshooting lol.
To help us reproduce and look into the AttributeError when loading GPU weights, could you share more info (full error log, model/weight format, launch command, etc)?
To help us reproduce and look into the AttributeError when loading GPU weights, could you share more info (full error log, model/weight format, launch command, etc)?
Launch command:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
AMX_METHOD=AMXINT8 \
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 5000 \
--model /mnt/data/models/DeepSeek-V3-0324-GPU-FP8-GPTQ4 \
--kt-weight-path /mnt/data/models/DeepSeek-V3-0324-CPU-NUMA4-AMXINT8 \
--kt-cpuinfer 128 \
--kt-threadpool-count 4 \
--kt-num-gpu-experts 0 \
--kt-method AMXINT8 \
--attention-backend flashinfer \
--trust-remote-code \
--mem-fraction-static 0.98 \
--chunked-prefill-size 4096 \
--max-running-requests 1 \
--context-length 131072 \
--max-total-tokens 131072 \
--served-model-name DeepSeek-V3-0324 \
--enable-mixed-chunk \
--disable-shared-experts-fusion
Error traceback:
[2025-11-10 19:29:07] Scheduler hit an exception: Traceback (most recent call last):
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2679, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 312, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 237, in __init__
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 323, in __init__
self.initialize(min_per_gpu_memory)
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 409, in initialize
self.load_model()
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 763, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 594, in load_model
model = _initialize_model(
^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 262, in _initialize_model
return model_class(**kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 3034, in __init__
self.model = DeepseekV2Model(
^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 2830, in __init__
self.layers, self.start_layer, self.end_layer = make_layers(
^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 576, in make_layers
+ get_offloader().wrap_modules(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/utils/offloader.py", line 36, in wrap_modules
return list(all_modules_generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 578, in <genexpr>
layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 2832, in <lambda>
lambda idx, prefix: DeepseekV2DecoderLayer(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 2588, in __init__
self.self_attn = DeepseekV2AttentionMLA(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 1306, in __init__
and self.fused_qkv_a_proj_with_mqa.weight.dtype == torch.bfloat16
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/docshotgun/sglang/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in __getattr__
raise AttributeError(
AttributeError: 'ReplicatedLinear' object has no attribute 'weight'
GPU weights (mirror of the modelscope weights): https://huggingface.co/CPU-Hybrid-MoE/DeepSeek-V3-0324-GPU-FP8-GPTQ4 CPU weights: https://huggingface.co/CPU-Hybrid-MoE/DeepSeek-V3-0324-CPU-NUMA4-AMXINT8
This was previously able to launch and produce subtly degraded text outputs with https://github.com/kvcache-ai/ktransformers/commit/d7ab3b41b7c80eb55ac012099aba8e35991f0b3c and an older commit of sglang, don't think I have that documented but it would have been around the time the lmsys blog post about this backend was published. Was wanting to test and see if more recent commits fixed the issue of subtly degraded outputs.
@DocShotgun For the DeepSeek series models, we previously hard-coded the behavior in our codebase: the non-expert layers were always treated as FP8 and the expert layers as GPTQ4, instead of detecting this automatically from the config. That worked for older versions, but in recent updates we removed the hard-coded path and switched to fully relying on the model config to determine whether FP8 / GPTQ4 are used.
So for newer versions, if the FP8 settings aren't explicitly present in config.json, loading can fail or behave incorrectly. To fix this, please add the following field under quantization_config in your config.json:
"linear_fp8_config": {
"activation_scheme": "dynamic",
"fmt": "e4m3",
"quant_method": "fp8",
"weight_block_size": [
128,
128
]
}
We'll update the HuggingFace model files and documentation soon.
Roger that, updated my config.json and added that block, keeping the config_groups key with the GPTQ params. Loads and infers again at around 13 T/s on my setup (dual Xeon platinum + RTX pro 6000) with FP8 non-experts on GPU and INT8 experts on CPU. Will need additional testing to see if I can replicate the slight incoherence/degraded output quality I observed previously.
Also, has the CPU weight conversion script been fixed for FP8 inputs?
Note:
* `SGLANG_ENABLE_JIT_DEEPGEMM=0` seems to be needed for RTX Pro 6000 Blackwell because otherwise `sglang` tries to use DeepGEMM which only supports sm_90 and sm_100 * I had to switch to the flashinfer attention backend because the triton attention backend's kernels require too much shared memory for sm_120 and fail to compile (possibly also some assumptions about sm_120 behaving like sm_90 or sm_100 when it's actually more similar to sm_89)
Side-notes:
- I don't seem to require
SGLANG_ENABLE_JIT_DEEPGEMM=0in main. - This can be fixed following steps outline here https://github.com/sgl-project/sglang/tree/v0.5.5/benchmark/kernels/fused_moe_triton or writeup https://www.jamesflare.com/tuning-fused-moe-triton/. I've done it for tp=2 for GLM4.5-Air in https://github.com/sgl-project/sglang/pull/13711, it's been done for tp=4 in https://github.com/sgl-project/sglang/pull/9251 but for the MaxQ edition. Things I'm unsure of:
- #9251 has an extra expert (127), not sure why, probably the MTP layer? So it might need to be added manually
- I don't know if you can bench GLM4.6 or GLM4.5 air on a single GPU since it does not have enough RAM to load it. When I launched the script it doesn't look like it uses more than 7~8GB of VRAM so might work (it doesn't load the model, just the config)
Another thing:
I had to switch to the flashinfer attention backend because the triton attention backend's kernels require too much shared memory for sm_120 and fail to compile (possibly also some assumptions about sm_120 behaving like sm_90 or sm_100 when it's actually more similar to sm_89)
From what I understand, you have 2 backends involved:
- the attention backend (triton, flashinfer, trtllm, flashattention, can be different for context/prompt/prefill and decode/generation) that lives in https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
- the moe backend (triton, cutlass, flashinfer, KTransformers) that lives in https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/moe
So you can mix flashinfer attention with triton fused-moe kernels.
Maybe the deepgemm issue is solved now, as this was around a month ago when I tested it. But regardless the solution is to just not use deepgemm since it only supports sm_90 and sm_100 - whether it's automatically disabled by sglang or if having to manually disable it via env var.
I'm not sure if its worth going through the hassle of tuning the triton attention backend for my GPU setup since flashinfer works ootb, unless there is a massive performance boost.
I'm not sure if its worth going through the hassle of tuning the triton attention backend for my GPU setup since flashinfer works ootb, unless there is a massive performance boost.
On GLM-Air-FP8 with TP=2, 81tok/s to 100tok/s for single query inference if Flashinfer + Triton MoE, and for batching 64 queries 450 tok/s for flashinfer attention+MoE to 660 tok/s for Triton Attention+FusedMoE to 1100 tok/s for flashinfer attention + Triton FusedMoE, see screenshots here: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3562367594
That said, MoE is the easy part to run on CPU so I expect KTransformers will put that on CPU so Triton kernels can't be used.