Using kvcached + sglang + qwen-fp8 directly causes an out-of-bounds error. [bug]
I’m currently using an image-based setup to start kvcached + sglang, and now I want to use the Qwen3 FP8 model. The inference framework can be successfully deployed, but as soon as I send a request to sglang, an out-of-bounds error occurs. However, if I switch the FP8 model back to a BF16 model, everything works fine and no errors appear.
Here is my run command.
python3 -m sglang.launch_server --model /workspace/data/data/Qwen3-8B-FP8/ --disable-radix-cache --port 30000
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
[kvcached][INFO][2025-10-30 07:45:57][patch_base.py:98] Applying 3 patches for sglang
[kvcached][INFO][2025-10-30 07:46:02][version_utils.py:189] Detected sglang version: 0.5.3
[kvcached][INFO][2025-10-30 07:46:02][version_utils.py:189] Detected sglang version: 0.5.3
[kvcached][INFO][2025-10-30 07:46:07][version_utils.py:189] Detected sglang version: 0.5.3
[kvcached][INFO][2025-10-30 07:46:07][patch_base.py:178] Successfully patched sglang: elastic_allocator, elastic_memory_pool, scheduler_memory_leak
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.server_args:
########################################################################
For contributors and developers: #
Please move environment variable definitions to sglang.srt.environ #
using the following pattern: #
SGLANG_XXX = EnvBool(False) #
#
########################################################################
[2025-10-30 07:47:48] INFO: 127.0.0.1:56338 - "POST /generate HTTP/1.1" 200 OK
[2025-10-30 07:47:48] The server is fired up and ready to roll!
[2025-10-30 07:47:57] INFO: 127.0.0.1:54362 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-30 07:47:57] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [8,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` f
My machine has 8 × B200 GPUs, but I only used one B200.
@ivanium
B200 is a bit different from H series. Will get back soon.
Thanks for the issue. I edited the log format a little bit for better presentation. This could be related to FP8 data type. Please use BF16 or FP16 while we are checking.
Thank you very much for your reply. I’m also looking into the kvcached code and would like to contribute to fixing this bug. From my perspective, this project truly has great potential. However, if formats like FP8 or FP4 can’t be used, loading multiple model weights will take up a large amount of GPU memory, leaving very limited space for the remaining KV cache. Moreover, if we intend to use multi-model loading for 7B or 8B models, I believe the most cost-effective option would be consumer GPUs like the 4090 or 5090—so implementing quantization is definitely essential. @jiarong0907 @ivanium
Thanks for digging into the code! We totally agree that quantization is a must. We'd love to collaborate if you are interested in helping with the integration. Please feel free to join the slack (link available in README, and here: https://join.slack.com/t/ovg-project/shared_invite/zt-3fr01t8s7-ZtDhHSJQ00hcLHgwKx3Dmw) and let's follow up there to coordinate ideas!
Thanks for digging into the code! We totally agree that quantization is a must. We'd love to collaborate if you are interested in helping with the integration. Please feel free to join the slack (link available in README, and here: https://join.slack.com/t/ovg-project/shared_invite/zt-3fr01t8s7-ZtDhHSJQ00hcLHgwKx3Dmw) and let's follow up there to coordinate ideas!
I just joined Slack.
@jiahe7ay Feel free to ping us on Slack so that we can add you to the deveoper channel.