Ma Mingfei
Ma Mingfei
@ubergarm I happen to know the people who are doing the ktransformer project. Its idea of utilizing Xeon (large memory) to host MoE and GPU for other layers is fanscinating....
> > [@ubergarm](https://github.com/ubergarm) I happen to know the people who are doing the ktransformer project. Its idea of utilizing Xeon (large memory) to host MoE and GPU for other layers...
@chunyuan-w LGTM! Let's wait @blzheng finished the CMakeList.txt change and rebase after it.
@chunyuan-w need to fix CI failure if they are true.
@chunyuan-w please rebase as https://github.com/sgl-project/sglang/pull/6115 has been landed.
@blossomin ascend also does not support fp8, they re-quantize the model to int8. On the CPU path, we also support int8 with w8a8 per channel recipe, it is the same...
@yanbing-j we also need kernel level test cases: `decode_attention` and `extend_attention`. Will help us debug future optimizations.
you can cherry-pick commits from our developing branch [cpu_opt_ww11](https://github.com/mingfeima/sglang/tree/cpu_opt_ww11) if necessary, as this will keep original commit messages.
use cherry-pick, don't directly replace files from our working branch.
@yanbing-j this PR covers too much extra scope: it has tensor parallel related staff and also the MoE layers change. I expect this PR only covers that: * intel amx...