[Feature] HiCache JIT kernel (once again)
Motivation
A fixed and improved version of #13453.
Modifications
- Rebase to main.
- Fix compatibility issue, support
gcc >= 11.4.0. Sincegcc 11.4.0is widely used and default in ubuntu22.04, I think this should be acceptable. - For unsupported platforms like
ROCmorNPU, we will fallback to thesgl-kernelversion kernel.
Accuracy Tests
Benchmarking and Profiling
Checklist
- [ ] Format your code according to the Format code with pre-commit.
- [ ] Add unit tests according to the Run and add unit tests.
- [ ] Update documentation according to Write documentations.
- [ ] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
- [ ] Follow the SGLang code style guidance.
- [ ] Work with maintainers to merge your PR. See the PR Merge Process
Summary of Changes
Hello @DarkSharpness, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request reintroduces and significantly refines the JIT kernel for HiCache operations, aiming to enhance the performance of KV cache data transfers within the system. It addresses previous compatibility concerns by ensuring support for GCC versions 11.4.0 and newer, and implements a robust fallback mechanism for platforms where the JIT kernel is not supported. The changes involve new CUDA C++ kernels for efficient data movement, Python bindings to dynamically compile and utilize these kernels, and comprehensive utilities for tensor validation and warp-level memory access. A new benchmarking script is also included to measure the performance gains of the JIT kernel.
Highlights
- Rebased to Main: The pull request has been rebased to the latest main branch, ensuring all changes are up-to-date with the current codebase.
- GCC 11.4.0+ Compatibility: Compatibility issues have been fixed to support GCC versions 11.4.0 and newer, which is a common default in Ubuntu 22.04 environments.
- Platform Fallback Mechanism: A fallback mechanism is implemented where the system will revert to the existing sgl-kernel for platforms not supported by the JIT kernel, such as ROCm or NPU.
- JIT HiCache Kernel Introduction: A new Just-In-Time compiled CUDA kernel for HiCache operations has been introduced, designed to optimize and accelerate KV cache data transfers.
- Performance Benchmarking Script: A new Python script has been added to benchmark the performance and verify the correctness of the JIT HiCache kernel against the reference AOT (Ahead-Of-Time) compiled kernel.
- Enhanced Tensor Validation: New C++ utilities for robust tensor shape, stride, data type, and device validation have been added, improving the reliability of kernel argument checking.
- Optimized Warp-Level Memory Operations: CUDA warp-level primitives are included to enable more efficient and coalesced global memory access patterns within the new kernels.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
Some files were not included in the sglang wheel package during packaging.
Main seems to be broken. In an intra-network H20 it failed, but I ran in a public-network B200, it passed. Weird.
$SGLANG_VLM_CACHE_SIZE_MB=2048 python -m sglang.launch_server --model-path /home/admin/Qwen3-VL-2B-Thinking --host 0.0.0.0 --port 8188 --trust-remote-code --tp-size 1 --enable-cache-report --log-level info --max-running-requests 64 --mem-fraction-static 0.6 --chunked-prefill-size 8192 --attention-backend flashinfer --mm-attention-back fa3
[2025-11-24 10:46:04] INFO utils.py:148: Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-24 10:46:04] INFO utils.py:151: Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-24 10:46:04] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-11-24 10:46:05] ERROR compile_cache.py:130: The following environment variables for compile cache are unset: MODEL_URL, MODEL_DEPLOY_STRATEGY_NAME, OSS_ENDPOINT, OSS_BUCKET, ACCESS_KEY_ID, ACCESS_KEY_SECRET, COMPILE_CACHE_OSS_PREFIX
INFO 11-24 10:46:06 [__init__.py:216] Automatically detected platform cuda.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 33, in <module>
run_server(server_args)
File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 23, in run_server
from sglang.srt.entrypoints.http_server import launch_server
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/entrypoints/http_server.py", line 50, in <module>
from sglang.srt.entrypoints.engine import _launch_subprocesses
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/entrypoints/engine.py", line 47, in <module>
from sglang.srt.managers.data_parallel_controller import (
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/data_parallel_controller.py", line 38, in <module>
from sglang.srt.managers.scheduler import run_scheduler_process
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 48, in <module>
from sglang.srt.disaggregation.decode_kvcache_offload_manager import (
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/disaggregation/decode_kvcache_offload_manager.py", line 10, in <module>
from sglang.srt.managers.cache_controller import HiCacheController
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/cache_controller.py", line 24, in <module>
from sglang.srt.mem_cache.hicache_storage import (
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/mem_cache/hicache_storage.py", line 10, in <module>
from sglang.srt.mem_cache.memory_pool_host import HostKVCache
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/mem_cache/memory_pool_host.py", line 10, in <module>
from sglang.jit_kernel.hicache import can_use_hicache_jit_kernel
File "/opt/conda/lib/python3.10/site-packages/sglang/jit_kernel/hicache.py", line 7, in <module>
from sglang.jit_kernel.utils import load_jit, make_cpp_args
File "/opt/conda/lib/python3.10/site-packages/sglang/jit_kernel/utils.py", line 37, in <module>
KERNEL_PATH = _resolve_kernel_path()
File "/opt/conda/lib/python3.10/site-packages/sglang/jit_kernel/utils.py", line 33, in _resolve_kernel_path
raise RuntimeError("Cannot find sgl-kernel/jit path")
RuntimeError: Cannot find sgl-kernel/jit path
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -131,6 +131,9 @@ sglang = "sglang.cli.main:main"
"srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp",
"srt/speculative/cpp_ngram/*.cpp",
"srt/speculative/cpp_ngram/*.h",
+ "jit_kernel/include/sgl_kernel/*.h",
+ "jit_kernel/include/sgl_kernel/*.cuh",
+ "jit_kernel/csrc/*.cuh"
]
@DarkSharpness Could you please double confirm it?