kvcached icon indicating copy to clipboard operation
kvcached copied to clipboard

Two Qwen3-32B-FP8 instances on an H20 96G GPU using vLLM fails to process requests

Open shpgy-shpgy opened this issue 1 month ago • 4 comments

When deploying two Qwen3-32B-FP8 instances on a single 96G H20 card using vLLM with kvcached, the second launched instance cannot process requests and gets stuck directly. However, when using SGLang, both instances can handle requests normally.

Image

shpgy-shpgy avatar Oct 29 '25 08:10 shpgy-shpgy

It looks like the first engine has somehow taken most of GPU memory (95% KV cache usage). In this case the second engine cannot allocate KV caches and hence cannot proceed. Are you using the same workload for vLLM and SGLang? It would also be helpful if you could quickly try a smaller workload and attach the log if the issue persists.

ivanium avatar Oct 29 '25 09:10 ivanium

It looks like the first engine has somehow taken most of GPU memory (95% KV cache usage). In this case the second engine cannot allocate KV caches and hence cannot proceed. Are you using the same workload for vLLM and SGLang? It would also be helpful if you could quickly try a smaller workload and attach the log if the issue persists.

@ivanium This is the GPU memory usage after launching an instance: Image And this is the situation after launching two. Image

shpgy-shpgy avatar Oct 29 '25 10:10 shpgy-shpgy

@shpgy-shpgy Thanks for providing the detailed nvidia-smi information.

Yeah, as I can see from the screenshots, the GPU is almost running out of memory. When one instance is processing requests, it could consume additional memory for KV cache and activations, which could cause the second instance has insufficient memory for KV cache to process requests.

Why SGLang can work? From my experience, even launching the same models, SGLang and vLLM could have different memory footprint. So in this case, it could be that SGLang just consume a bit less memory than vLLM, so it has room to process requests.

jiarong0907 avatar Oct 29 '25 15:10 jiarong0907

@shpgy-shpgy Thanks for providing the detailed nvidia-smi information.

Yeah, as I can see from the screenshots, the GPU is almost running out of memory. When one instance is processing requests, it could consume additional memory for KV cache and activations, which could cause the second instance has insufficient memory for KV cache to process requests.

Why SGLang can work? From my experience, even launching the same models, SGLang and vLLM could have different memory footprint. So in this case, it could be that SGLang just consume a bit less memory than vLLM, so it has room to process requests.

@jiarong0907 Got it! Thank you for your reply very much!

shpgy-shpgy avatar Oct 30 '25 03:10 shpgy-shpgy