defer expert
我在所有文件里都没找到关于n_deferred_experts部分的代码
not support yet
not support yet May I ask when you will support defer expert?
这部分代码尚未合并到主分支,合并工作正在进行中。可以先看sosp25-ae分支。
This part of the code has not yet been merged into the main branch, and the merging process is ongoing. You can refer to the sosp25-ae branch in the meantime.
这部分代码尚未合并到主分支,合并工作正在进行中。可以先看sosp25-ae分支。
This part of the code has not yet been merged into the main branch, and the merging process is ongoing. You can refer to the sosp25-ae branch in the meantime. 因为论文中提到的yaml中有关于n_deferred_experts的代码,请问0.4.1版本中有关于n_deferred_experts的代码么 Since the paper mentions code related to n_deferred_experts in the YAML, does version 0.4.1 have code related to n_deferred_experts?
KTransformers is refactored and the YAML-based flexible injection framework is currently deprecated. The inference part now resides on kt-kernel and is recommended to be launched with SGLang. When launching SGLang server, you can specify --kt-max-deferred-experts-per-token to control the number of deferred experts.
Related PR you may need: #1545
KTransformers is refactored and the YAML-based flexible injection framework is currently deprecated. The inference part now resides on kt-kernel and is recommended to be launched with SGLang. When launching SGLang server, you can specify
--kt-max-deferred-experts-per-tokento control the number of deferred experts.Related PR you may need: #1545
请问使用kt 0.4.2版本+sglang,能在一个显存为40G的A100 上运行Qwen2-57B-A14B模型么?因为我运行一个Qwen3-30B-A3B模型已经快占用40GB了。 May I ask if using kt version 0.4.2 sglang, is it possible to run the Qwen2-57B-A14B model on an A100 with 40GB of VRAM? Because when I ran the Qwen3-30B-A3B model, it almost used up the 40GB.