sglang
sglang copied to clipboard
Support pass manager framework and fusion pass
Motivation
Pass is the key component in code transformation, optimization or analysis used by compiler such as LLVM/TVM.
For example in LLVM, a Pass Manager is introduced as a component of the LLVM compiler infrastructure. The main goal is to:
- orchestrate the execution of a sequence of passes over a specific unit of intermediate representation (IR), such as a module or a function.
- pipeline the execution of passes for better performance, manage analysis results and their invalidation, and enforce a disciplined workflow for pass developers.
In SGLang, we introduce a similar Pass Manager framework in order to orchestrate the execution of passes. Further more we adopt a fusion pass to fuse several complex operators, more passes such as AsyncTPPass, SequenceParallelismPass and etc., are to be introduced in the following PRs .
Many code pieces are borrowed from vLLM with significant SGLang customization. We express our respects to relative vLLM developers on this area.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
- [ ] Format your code according to the Format code with pre-commit.
- [ ] Add unit tests according to the Run and add unit tests.
- [ ] Update documentation according to Write documentations.
- [ ] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Summary of Changes
Hello @yuan-luo, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the SGLang compilation backend by introducing a new, unified configuration system (SGLangConfig) that centralizes model, device, and compilation settings. It also lays the groundwork for advanced graph optimizations by adding support for collective fusion passes, specifically targeting allreduce and RMSNorm operations using FlashInfer. The changes aim to improve the modularity, maintainability, and performance of the SGLang compilation pipeline, although some fusion passes are currently marked as work-in-progress.
Highlights
- New SGLang Configuration System: Introduced a centralized
SGLangConfigdataclass to manage model, device, and compilation settings, replacing the oldCompilationConfigfor a more unified approach. - Refactored Compilation Backend: The
SGLangBackendandPiecewiseCompileInterpreterclasses were updated to integrate seamlessly with the newSGLangConfigandCompilationConfig, streamlining how compilation settings are passed and utilized throughout the system. - Introduced Collective Fusion Pass: Added
AllReduceFusionPassto optimizeallreduce + RMSNormoperations, leveraging FlashInfer for potential performance gains. This includes new pattern matching utilities to identify and replace these operations. - Enhanced Compilation Cache Logic: Implemented cache loading for previously compiled graphs within the
SGLangBackend, along with detailed timing mechanisms to record compilation and cache hit durations, aiming to reduce redundant compilation efforts. - New Inductor Pass Framework: Created
SGLangInductorPassandSGLangPatternMatcherPassto provide a structured and extensible way for defining and managing custom Inductor passes, including built-in logging and debugging utilities.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
Basic function passed, doing more verification in progress.
Regression passed. Ready to review.
...
Capturing batches (bs=1 avail_mem=22.80 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:30<00:00, 1.71it/s]
[2025-11-09 00:47:24 TP0] Registering 5044 cuda graph addresses
[2025-11-09 00:47:25 TP3] Capture cuda graph end. Time elapsed: 31.21 s. mem usage=1.46 GB. avail mem=22.78 GB.
[2025-11-09 00:47:25 TP1] Capture cuda graph end. Time elapsed: 31.29 s. mem usage=1.46 GB. avail mem=22.75 GB.
[2025-11-09 00:47:25 TP0] Capture cuda graph end. Time elapsed: 31.29 s. mem usage=1.46 GB. avail mem=22.78 GB.
[2025-11-09 00:47:25 TP2] Capture cuda graph end. Time elapsed: 31.34 s. mem usage=1.46 GB. avail mem=22.75 GB.
[2025-11-09 00:47:26 TP0] max_total_num_tokens=5958788, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=22.78 GB
[2025-11-09 00:47:26] INFO: Started server process [546876]
[2025-11-09 00:47:26] INFO: Waiting for application startup.
[2025-11-09 00:47:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-11-09 00:47:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-11-09 00:47:26] INFO: Application startup complete.
[2025-11-09 00:47:26] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-09 00:47:27] INFO: 127.0.0.1:59104 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-09 00:47:27 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-09 00:48:56] INFO: 127.0.0.1:59106 - "POST /generate HTTP/1.1" 200 OK
[2025-11-09 00:48:56] The server is fired up and ready to roll!
[2025-11-09 00:49:27 TP0] Prefill batch, #new-seq: 1, #new-token: 33, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 66, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.33, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 106, token usage: 0.00, cuda graph: True, gen throughput (token/s): 239.64, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 146, token usage: 0.00, cuda graph: True, gen throughput (token/s): 238.96, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 186, token usage: 0.00, cuda graph: True, gen throughput (token/s): 237.71, #queue-req: 0,
[2025-11-09 00:49:28 TP0] Decode batch, #running-req: 1, #token: 226, token usage: 0.00, cuda graph: True, gen throughput (token/s): 237.62, #queue-req: 0,
[2025-11-09 00:49:28] INFO: 127.0.0.1:49302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
➜ /sgl-workspace python test_openai.py
ChatCompletion(id='b5d705da849a42f7b6e9716669d7e2e0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="<think>\nOkay, the user asked for three countries and their capitals, and then how I rank them. Let me start by picking three countries. Maybe the US, Japan, and Brazil. Their capitals are Washington, D.C., Tokyo, and Brasília. Now, how to rank them? The user didn't specify the criteria, so I need to think of possible ways. Maybe by population, economic size, or cultural influence. Let me check the population. The US has around 330 million, Japan about 125 million, Brazil 215 million. So US first, Brazil second, Japan third. But if I consider GDP, the US is the largest, then Japan, then Brazil. Alternatively, cultural influence: Japan has a strong cultural impact, maybe higher than Brazil. But the user might not have a specific criteria. I should mention that the ranking depends on the criteria and provide examples. Also, make sure the capitals are correct. Washington, D.C", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1762678168, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=33, total_tokens=233, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})
@yuan-luo
Hi, thanks for your contribution! Could you please add some benchmark result/script so that we can make some verification? Thanks
@yuan-luo
Hi, thanks for your contribution! Could you please add some benchmark result/script so that we can make some verification? Thanks
@Oasis-Git ok, I'll add it.