kvcached icon indicating copy to clipboard operation
kvcached copied to clipboard

KVCached support for gpt-oss-20b attention type not supported in SGLang

Open deepak-vij opened this issue 1 month ago • 5 comments

[2025-10-12 21:14:09] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=163.13 GB, mem usage=14.53 GB. [kvcached][INFO][2025-10-12 21:14:09][page_allocator.py:152] Init kvcached KV cache allocator: num_layers=24, mem_size_per_layer=179288MB, total_mem_size=8605824MB, page_size=2MB, tp_size=1, async_sched=True, contiguous_layout=True, enable_prealloc=True [kvcached][WARNING][2025-10-12 21:14:09][interfaces.py:84] kvcached is only tested with page_size=1 for SGLang. [2025-10-12 21:14:09] Scheduler hit an exception: Traceback (most recent call last): File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2587, in run_scheduler_process scheduler = Scheduler( ^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 329, in init self.tp_worker = TpWorkerClass( ^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 71, in init self.worker = TpModelWorker( ^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 93, in init self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 250, in init self.initialize(min_per_gpu_memory) File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 378, in initialize self.init_memory_pool( File "/data04/deepak/kvcached/engine_integration/scripts/sglang-kvcached-venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1464, in init_memory_pool self.token_to_kv_pool = MHATokenToKVPool( ^^^^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/kvcached/integration/sglang/patches.py", line 190, in init self._create_buffers() File "/data04/deepak/kvcached/kvcached/integration/sglang/patches.py", line 225, in _create_buffers self.k_buffer, self.v_buffer = kvi.alloc_kv_cache( ^^^^^^^^^^^^^^^^^^^ File "/data04/deepak/kvcached/kvcached/integration/sglang/interfaces.py", line 118, in alloc_kv_cache contiguous_tensor = raw_kv_tensors[0].view( ^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: shape '[249298944, 24, 2, 8, 64]' is invalid for input of size 95730794496

[2025-10-12 21:14:09] Received sigquit from a child process. It usually means the child failed. ./start_server.sh: line 168: 2765843 Killed $PYTHON -m sglang.launch_server --model "$MODEL" --disable-radix-cache --trust-remote-code --port "$SGL_PORT" --tp "$TP_SIZE" $SGL_L4_ARGS

  • [[ -n ../../engine_integration/scripts/sglang-kvcached-venv ]]
  • deactivate
  • unset -f pydoc
  • '[' -z _ ']'
  • PATH=/data04/deepak/.venv/bin:/data00/code/install/openmpi-5.0.8/install/bin:/data00/code/install/nccl-tests/build:/data00/anaconda3/condabin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/cuda-12.8/bin/:/usr/local/nccl_2.26.2-1+cuda12.8_x86_64/bin/
  • export PATH
  • unset _OLD_VIRTUAL_PATH
  • '[' -z '' ']'
  • hash -r
  • '[' -z _ ']'
  • PS1=
  • export PS1
  • unset _OLD_VIRTUAL_PS1
  • unset VIRTUAL_ENV
  • unset VIRTUAL_ENV_PROMPT
  • '[' '!' '' = nondestructive ']'
  • unset -f deactivate

deepak-vij avatar Oct 28 '25 17:10 deepak-vij

Hi @deepak-vij, The attention type needed for gpt-oss hasn't been supported yet. We are working on it. Stay tuned!

jiarong0907 avatar Oct 28 '25 19:10 jiarong0907

@jiarong0907 @ivanium , look forward to collaborating with you folks on this effort.

deepak-vij avatar Oct 28 '25 20:10 deepak-vij

Hi @deepak-vij, after some investigation, we now plan to support vLLM first by adding --disable-hybrid-kv-cache-manager when launching the vllm server. Feel free to try it out and support sglang if you want. Thanks!

ivanium avatar Nov 20 '25 05:11 ivanium

@ivanium , I am going to do that right now and let you know. I was on vacation for a while. Thanks.

deepak-vij avatar Dec 01 '25 19:12 deepak-vij

@ivanium, it seems SGLang does not provide support for --disable-hybrid-kv-cache-manager

deepak-vij avatar Dec 01 '25 23:12 deepak-vij