Using nvfp4 + kvcached + sglang results in a type mismatch error

Open jiahe7ay opened this issue 1 month ago • 0 comments

When I use nvfp4 + kvcached + sglang, the following error occurs: File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/base_attn_backend.py", line 91, in forward return self.forward_decode( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 815, in forward_decode forward_batch.token_to_kv_pool.set_kv_buffer( File "/usr/local/lib/python3.12/dist-packages/sglang/srt/mem_cache/memory_pool.py", line 823, in set_kv_buffer self.k_buffer[layer_id - self.start_layer][loc] = cache_k ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^ RuntimeError: Index put requires the source and destination dtypes match, got Float8_e4m3fn for the destination and Byte for the source.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 2802, in run_scheduler_process scheduler = Scheduler( ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 311, in init self.tp_worker = TpModelWorker( ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker.py", line 237, in init self._model_runner = ModelRunner( ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 322, in init self.initialize(min_per_gpu_memory) File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 479, in initialize self.init_device_graphs() File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 1995, in init_device_graphs self.graph_runner = graph_runnersself.device ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 383, in init raise Exception( Exception: Capture cuda graph failed: Index put requires the source and destination dtypes match, got Float8_e4m3fn for the destination and Byte for the source. Possible solutions:

set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
set --cuda-graph-max-bs to a smaller value (e.g., 16)
disable torch compile by not using --enable-torch-compile
disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss) Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

However, using sglang's native mode together with nvfp4 does not produce any errors, and the execution is correct.

My guess is that this happens because in kvcached, the kvcache type is determined by the code in csrc. However, nvfp4 uses a somewhat special type, which may require additional handling in the csrc code.

Nov 10 '25 11:11 jiahe7ay