ktransformers defer expert

我在所有文件里都没找到关于n_deferred_experts部分的代码

Oct 31 '25 13:10 alxkcnwnjc

not support yet

Nov 03 '25 02:11 ovowei

not support yet May I ask when you will support defer expert?

Nov 03 '25 02:11 alxkcnwnjc

这部分代码尚未合并到主分支，合并工作正在进行中。可以先看sosp25-ae分支。

This part of the code has not yet been merged into the main branch, and the merging process is ongoing. You can refer to the sosp25-ae branch in the meantime.

Nov 03 '25 03:11 chenht2022

这部分代码尚未合并到主分支，合并工作正在进行中。可以先看sosp25-ae分支。

This part of the code has not yet been merged into the main branch, and the merging process is ongoing. You can refer to the sosp25-ae branch in the meantime. 因为论文中提到的yaml中有关于n_deferred_experts的代码，请问0.4.1版本中有关于n_deferred_experts的代码么 Since the paper mentions code related to n_deferred_experts in the YAML, does version 0.4.1 have code related to n_deferred_experts?

Nov 16 '25 06:11 alxkcnwnjc

KTransformers is refactored and the YAML-based flexible injection framework is currently deprecated. The inference part now resides on kt-kernel and is recommended to be launched with SGLang. When launching SGLang server, you can specify --kt-max-deferred-experts-per-token to control the number of deferred experts.

Related PR you may need: #1545

Nov 18 '25 08:11 chenht2022

KTransformers is refactored and the YAML-based flexible injection framework is currently deprecated. The inference part now resides on kt-kernel and is recommended to be launched with SGLang. When launching SGLang server, you can specify --kt-max-deferred-experts-per-token to control the number of deferred experts.

Related PR you may need: #1545

请问使用kt 0.4.2版本+sglang,能在一个显存为40G的A100 上运行Qwen2-57B-A14B模型么?因为我运行一个Qwen3-30B-A3B模型已经快占用40GB了。 May I ask if using kt version 0.4.2 sglang, is it possible to run the Qwen2-57B-A14B model on an A100 with 40GB of VRAM? Because when I ran the Qwen3-30B-A3B model, it almost used up the 40GB.

Dec 02 '25 12:12 alxkcnwnjc