Ma Mingfei comments

Results 93 comments of


                                            Ma Mingfei

Eval bug: does llama.cpp support Intel AMX instruction? how to enable it

@ubergarm I happen to know the people who are doing the ktransformer project. Its idea of utilizing Xeon (large memory) to host MoE and GPU for other layers is fanscinating....

Eval bug: does llama.cpp support Intel AMX instruction? how to enable it

> > [@ubergarm](https://github.com/ubergarm) I happen to know the people who are doing the ktransformer project. Its idea of utilizing Xeon (large memory) to host MoE and GPU for other layers...

Add fp8 gemm kernel for CPU in sgl-kernel and add gemm UT

@chunyuan-w LGTM! Let's wait @blzheng finished the CMakeList.txt change and rebase after it.

Add fp8 gemm kernel for CPU in sgl-kernel and add gemm UT

@chunyuan-w need to fix CI failure if they are true.

Add fp8 gemm kernel for CPU in sgl-kernel and add gemm UT

@chunyuan-w please rebase as https://github.com/sgl-project/sglang/pull/6115 has been landed.

Add fp8 gemm kernel for CPU in sgl-kernel and add gemm UT

@blossomin ascend also does not support fp8, they re-quantize the model to int8. On the CPU path, we also support int8 with w8a8 per channel recipe, it is the same...

Add intel_amx backend for Radix Attention

@yanbing-j we also need kernel level test cases: `decode_attention` and `extend_attention`. Will help us debug future optimizations.

Add intel_amx backend for Radix Attention

you can cherry-pick commits from our developing branch [cpu_opt_ww11](https://github.com/mingfeima/sglang/tree/cpu_opt_ww11) if necessary, as this will keep original commit messages.

Add intel_amx backend for Radix Attention

use cherry-pick, don't directly replace files from our working branch.

Add intel_amx backend for Radix Attention

@yanbing-j this PR covers too much extra scope: it has tensor parallel related staff and also the MoE layers change. I expect this PR only covers that: * intel amx...