KVCached support for gpt-oss-20b attention type not supported in SGLang
[2025-10-12 21:14:09] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=163.13 GB, mem usage=14.53 GB. [kvcached][INFO][2025-10-12 21:14:09][page_allocator.py:152] Init kvcached KV cache allocator: num_layers=24, mem_size_per_layer=179288MB, total_mem_size=8605824MB, page_size=2MB, tp_size=1, async_sched=True, contiguous_layout=True, enable_prealloc=True [kvcached][WARNING][2025-10-12 21:14:09][interfaces.py:84] kvcached is only tested with page_size=1 for SGLang. [2025-10-12 21:14:09] Scheduler hit an exception: Traceback (most recent call last): File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2587, in run_scheduler_process scheduler = Scheduler( ^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 329, in init self.tp_worker = TpWorkerClass( ^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 71, in init self.worker = TpModelWorker( ^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 93, in init self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 250, in init self.initialize(min_per_gpu_memory) File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 378, in initialize self.init_memory_pool( File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1464, in init_memory_pool self.token_to_kv_pool = MHATokenToKVPool( ^^^^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/kvcached/integration/sglang/patches.py", line 190, in init self._create_buffers() File "/data04/deepak/kvcached/kvcached/integration/sglang/patches.py", line 225, in _create_buffers self.k_buffer, self.v_buffer = kvi.alloc_kv_cache( ^^^^^^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/kvcached/integration/sglang/interfaces.py", line 118, in alloc_kv_cache contiguous_tensor = raw_kv_tensors[0].view( ^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: shape '[249298944, 24, 2, 8, 64]' is invalid for input of size 95730794496
[2025-10-12 21:14:09] Received sigquit from a child process. It usually means the child failed. ./start_server.sh: line 168: 2765843 Killed $PYTHON -m sglang.launch_server --model "$MODEL" --disable-radix-cache --trust-remote-code --port "$SGL_PORT" --tp "$TP_SIZE" $SGL_L4_ARGS
- [[ -n ../../engine_integration/scripts/sglang-kvcached-venv ]]
- deactivate
- unset -f pydoc
- '[' -z _ ']'
- PATH=/data04/deepak/.venv/bin:/data00/code/install/openmpi-5.0.8/install/bin:/data00/code/install/nccl-tests/build:/data00/anaconda3/condabin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/cuda-12.8/bin/:/usr/local/nccl_2.26.2-1+cuda12.8_x86_64/bin/
- export PATH
- unset _OLD_VIRTUAL_PATH
- '[' -z '' ']'
- hash -r
- '[' -z _ ']'
- PS1=
- export PS1
- unset _OLD_VIRTUAL_PS1
- unset VIRTUAL_ENV
- unset VIRTUAL_ENV_PROMPT
- '[' '!' '' = nondestructive ']'
- unset -f deactivate
Hi @deepak-vij, The attention type needed for gpt-oss hasn't been supported yet. We are working on it. Stay tuned!
@jiarong0907 @ivanium , look forward to collaborating with you folks on this effort.
Hi @deepak-vij, after some investigation, we now plan to support vLLM first by adding --disable-hybrid-kv-cache-manager when launching the vllm server. Feel free to try it out and support sglang if you want. Thanks!
@ivanium , I am going to do that right now and let you know. I was on vacation for a while. Thanks.
@ivanium, it seems SGLang does not provide support for --disable-hybrid-kv-cache-manager