sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Feature] RFC for adding CPU support for SGLang

Open mingfeima opened this issue 11 months ago • 8 comments

Motivation

Hi, SGLang folks! This is Mingfei from intel pytorch team, our team helps optimize PyTorch performance on CPU. I am also the PyTorch module maintainer for cpu performance. We would like to contribute to SGLang for CPU enabling and performance optimization.

Targets

Our primary target is to optimize SGLang performance on Intel Xeon Scalable Processors (x86 server CPUs).

  • Optimization will be focusing on Xeon with Intel® Advanced Matrix Extensions support, including Sapphire Rapids(4th gen), Emerald Rapids(5th gen), Granite Rapids(6th gen).
  • Native implementations or fallbacks will be provided for CPUs with other ISA to make it functional.
  • Providing good performance per dollar.

Limitations

  • Kernels written in avx512 and amx-bf16, requires GCC11 or above.
  • BFloat16/Float16 will be enabled at the same time on CPU, but we only focus on BFloat16 performance optimization at the current stage, Float16 optimization will be added later on.

Schedule for 25Q1

We will focusing on DeepSeek series at the moment to align with our internal development requirements and extend the model coverage later on.

Generic enabling/optimizations for sglang

  • [x] CPU device enabling. We intend to enable CPU device with torch native backend first and then gradually replace all the performance critical components with C++ intrinsics kernels. https://github.com/sgl-project/sglang/pull/2806
  • [x] fused kernels for rms_norm, silu_and_mul, sampling and so on.
  • [x] radix attention kernels for extend and decoding.

DeepSeek performance optimizations

(we are currently mapping the work from DeepSeek Multi-head Latent Attention (MLA) Throughput Optimizations)

  • [x] MLA decoding kernel optimization with head blocking.
  • [x] DeepSeekMoE (FusedMoE)
  • [x] fp8 kv cache (experimental)

Tensor Parallel

  • [x] Map TP to the multiple sockets (numa nodes) on a single node CPU
  • [ ] EPMoE

We hope to help more customers to build better user experience with deploying with sglang on CPU devices. Welcome any feedbacks, thanks!

mingfeima avatar Jan 09 '25 07:01 mingfeima

Hi @mingfeima Happy to collaborate with your team! Would you like to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ Thanks!

zhyncs avatar Jan 09 '25 08:01 zhyncs

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Mar 11 '25 00:03 github-actions[bot]

@zhyncs need to reopen this one. We are currently working on an internal branch to make sure everything is ready and then we will start upstream to sglang main branch. The optimization work on our side is almost finished, I expect we can start upstreaming soon.

mingfeima avatar Mar 11 '25 01:03 mingfeima

OK

zhyncs avatar Mar 11 '25 01:03 zhyncs

comment to keep this thread active, optimization work pretty much done internally.

mingfeima avatar Mar 24 '25 01:03 mingfeima

upstream the C++ kernels on https://github.com/sgl-project/sglang/pull/5150

mingfeima avatar Apr 08 '25 06:04 mingfeima

update using CMakeLists.txt: https://github.com/sgl-project/sglang/pull/6115

mingfeima avatar May 13 '25 01:05 mingfeima

fp8 gemm: https://github.com/sgl-project/sglang/pull/6216

mingfeima avatar May 13 '25 01:05 mingfeima

enable intel amx attention backend:

replace https://github.com/sgl-project/sglang/pull/6143

with https://github.com/sgl-project/sglang/pull/6405 https://github.com/sgl-project/sglang/pull/6408

mingfeima avatar May 14 '25 02:05 mingfeima

add fp8 shared moe kernels https://github.com/sgl-project/sglang/pull/6339

the shared moe kernels is an innovation that we have done on cpu backend, brings pretty good performance speedup for decoding when concurrency is small, for example, when concurrency is 1, it will computes 1 shared expert + 8 experts. if we don't fuse shared expert, it will take almost the time of 4 fused experts, which means of very low efficiency.

mingfeima avatar May 18 '25 03:05 mingfeima

add fp8 support for existing fused moe kernels: https://github.com/sgl-project/sglang/pull/6404

mingfeima avatar May 20 '25 04:05 mingfeima

Add docker build: https://github.com/sgl-project/sglang/pull/6458

mingfeima avatar May 20 '25 12:05 mingfeima

Created new label to further track this task more easily: we can either track from cpu or intel at the moment.

mingfeima avatar May 21 '25 03:05 mingfeima

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Jul 21 '25 00:07 github-actions[bot]