mini-sglang
mini-sglang copied to clipboard
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
Fixing a small typo. Thanks for the great read!
Suggestion: Change github about page to: A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems Currently this field is empty.
The previous comment suggested only 1 scheduler process existed, which was misleading. In reality, world_size scheduler processes are spawned (one per TP rank/GPU), but only the primary rank sends an...
### Description This PR implements a complete request cancellation mechanism to prevent GPU resource waste when clients disconnect. It addresses the issue described in #[15]. ### Verification before after fix...
**Summary** (only 94 lines of code) Adds an opt-in, per-request profiling path: clients can send `"profile": true` and mini-sglang will start a `torch.profiler` session for that request, then export a...
### Describe the bug When a client disconnects (e.g., Ctrl+C via curl), the backend (Scheduler/GPU) continues to generate tokens until `max_seq_len` is reached. This wastes GPU resources. ### Reproduction Use...
As a simple tutorial, can Mini-SGlang support CPU-only mode? I want my macbook can run it.
Hi, I am very impressed by the project and have learned a lot! Just curious whether this minimal implementation plans to support MOE architectures recently? Thank you!
python -m minisgl --model "../../Qwen/Qwen3-0.6B" ` [2025-12-20|08:57:37] INFO Parsed arguments: ServerArgs(model_path='../../Qwen/Qwen3-0.6B', tp_info=DistributedInfo(rank=0, size=1), dtype=torch.bfloat16, max_running_req=256, attention_backend='auto', cuda_graph_bs=None, cuda_graph_max_bs=None, page_size=1, memory_ratio=0.9, distributed_timeout=60.0, use_dummy_weight=False, use_pynccl=True, max_seq_len_override=None, num_page_override=None, max_extend_tokens=8192, cache_type='radix', offline_mode=False, _unique_suffix='.pid=2657', server_host='127.0.0.1',...
Refer to the writing style of BaseOP and reuse _concat_prefix in OPList