sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Support pass manager framework and fusion pass

Open yuan-luo opened this issue 1 month ago • 5 comments

Motivation

Pass is the key component in code transformation, optimization or analysis used by compiler such as LLVM/TVM.

For example in LLVM, a Pass Manager is introduced as a component of the LLVM compiler infrastructure. The main goal is to:

  • orchestrate the execution of a sequence of passes over a specific unit of intermediate representation (IR), such as a module or a function.
  • pipeline the execution of passes for better performance, manage analysis results and their invalidation, and enforce a disciplined workflow for pass developers.

In SGLang, we introduce a similar Pass Manager framework in order to orchestrate the execution of passes. Further more we adopt a fusion pass to fuse several complex operators, more passes such as AsyncTPPass, SequenceParallelismPass and etc., are to be introduced in the following PRs .

Many code pieces are borrowed from vLLM with significant SGLang customization. We express our respects to relative vLLM developers on this area.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

yuan-luo avatar Oct 19 '25 13:10 yuan-luo

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the SGLang compilation backend by introducing a new, unified configuration system (SGLangConfig) that centralizes model, device, and compilation settings. It also lays the groundwork for advanced graph optimizations by adding support for collective fusion passes, specifically targeting allreduce and RMSNorm operations using FlashInfer. The changes aim to improve the modularity, maintainability, and performance of the SGLang compilation pipeline, although some fusion passes are currently marked as work-in-progress.

Highlights

  • New SGLang Configuration System: Introduced a centralized SGLangConfig dataclass to manage model, device, and compilation settings, replacing the old CompilationConfig for a more unified approach.
  • Refactored Compilation Backend: The SGLangBackend and PiecewiseCompileInterpreter classes were updated to integrate seamlessly with the new SGLangConfig and CompilationConfig, streamlining how compilation settings are passed and utilized throughout the system.
  • Introduced Collective Fusion Pass: Added AllReduceFusionPass to optimize allreduce + RMSNorm operations, leveraging FlashInfer for potential performance gains. This includes new pattern matching utilities to identify and replace these operations.
  • Enhanced Compilation Cache Logic: Implemented cache loading for previously compiled graphs within the SGLangBackend, along with detailed timing mechanisms to record compilation and cache hit durations, aiming to reduce redundant compilation efforts.
  • New Inductor Pass Framework: Created SGLangInductorPass and SGLangPatternMatcherPass to provide a structured and extensible way for defining and managing custom Inductor passes, including built-in logging and debugging utilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

gemini-code-assist[bot] avatar Oct 19 '25 13:10 gemini-code-assist[bot]

Basic function passed, doing more verification in progress.

yuan-luo avatar Nov 09 '25 02:11 yuan-luo

Regression passed. Ready to review.

...
Capturing batches (bs=1 avail_mem=22.80 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:30<00:00,  1.71it/s]
[2025-11-09 00:47:24 TP0] Registering 5044 cuda graph addresses
[2025-11-09 00:47:25 TP3] Capture cuda graph end. Time elapsed: 31.21 s. mem usage=1.46 GB. avail mem=22.78 GB.
[2025-11-09 00:47:25 TP1] Capture cuda graph end. Time elapsed: 31.29 s. mem usage=1.46 GB. avail mem=22.75 GB.
[2025-11-09 00:47:25 TP0] Capture cuda graph end. Time elapsed: 31.29 s. mem usage=1.46 GB. avail mem=22.78 GB.
[2025-11-09 00:47:25 TP2] Capture cuda graph end. Time elapsed: 31.34 s. mem usage=1.46 GB. avail mem=22.75 GB.
[2025-11-09 00:47:26 TP0] max_total_num_tokens=5958788, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=22.78 GB
[2025-11-09 00:47:26] INFO:     Started server process [546876]
[2025-11-09 00:47:26] INFO:     Waiting for application startup.
[2025-11-09 00:47:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-11-09 00:47:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-11-09 00:47:26] INFO:     Application startup complete.
[2025-11-09 00:47:26] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-09 00:47:27] INFO:     127.0.0.1:59104 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-09 00:47:27 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,


[2025-11-09 00:48:56] INFO:     127.0.0.1:59106 - "POST /generate HTTP/1.1" 200 OK
[2025-11-09 00:48:56] The server is fired up and ready to roll!
[2025-11-09 00:49:27 TP0] Prefill batch, #new-seq: 1, #new-token: 33, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 66, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.33, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 106, token usage: 0.00, cuda graph: True, gen throughput (token/s): 239.64, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 146, token usage: 0.00, cuda graph: True, gen throughput (token/s): 238.96, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 186, token usage: 0.00, cuda graph: True, gen throughput (token/s): 237.71, #queue-req: 0,
[2025-11-09 00:49:28 TP0] Decode batch, #running-req: 1, #token: 226, token usage: 0.00, cuda graph: True, gen throughput (token/s): 237.62, #queue-req: 0,
[2025-11-09 00:49:28] INFO:     127.0.0.1:49302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
➜  /sgl-workspace python test_openai.py
ChatCompletion(id='b5d705da849a42f7b6e9716669d7e2e0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="<think>\nOkay, the user asked for three countries and their capitals, and then how I rank them. Let me start by picking three countries. Maybe the US, Japan, and Brazil. Their capitals are Washington, D.C., Tokyo, and Brasília. Now, how to rank them? The user didn't specify the criteria, so I need to think of possible ways. Maybe by population, economic size, or cultural influence. Let me check the population. The US has around 330 million, Japan about 125 million, Brazil 215 million. So US first, Brazil second, Japan third. But if I consider GDP, the US is the largest, then Japan, then Brazil. Alternatively, cultural influence: Japan has a strong cultural impact, maybe higher than Brazil. But the user might not have a specific criteria. I should mention that the ranking depends on the criteria and provide examples. Also, make sure the capitals are correct. Washington, D.C", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1762678168, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=33, total_tokens=233, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

yuan-luo avatar Nov 09 '25 08:11 yuan-luo

@yuan-luo

Hi, thanks for your contribution! Could you please add some benchmark result/script so that we can make some verification? Thanks

Oasis-Git avatar Nov 09 '25 21:11 Oasis-Git

@yuan-luo

Hi, thanks for your contribution! Could you please add some benchmark result/script so that we can make some verification? Thanks

@Oasis-Git ok, I'll add it.

yuan-luo avatar Nov 10 '25 09:11 yuan-luo