kvcached Two Qwen3-32B-FP8 instances on an H20 96G GPU using vLLM fails to process requests

When deploying two Qwen3-32B-FP8 instances on a single 96G H20 card using vLLM with kvcached, the second launched instance cannot process requests and gets stuck directly. However, when using SGLang, both instances can handle requests normally.

Oct 29 '25 08:10 shpgy-shpgy

It looks like the first engine has somehow taken most of GPU memory (95% KV cache usage). In this case the second engine cannot allocate KV caches and hence cannot proceed. Are you using the same workload for vLLM and SGLang? It would also be helpful if you could quickly try a smaller workload and attach the log if the issue persists.

Oct 29 '25 09:10 ivanium

It looks like the first engine has somehow taken most of GPU memory (95% KV cache usage). In this case the second engine cannot allocate KV caches and hence cannot proceed. Are you using the same workload for vLLM and SGLang? It would also be helpful if you could quickly try a smaller workload and attach the log if the issue persists.

@ivanium This is the GPU memory usage after launching an instance： And this is the situation after launching two.

Oct 29 '25 10:10 shpgy-shpgy

@shpgy-shpgy Thanks for providing the detailed nvidia-smi information.

Yeah, as I can see from the screenshots, the GPU is almost running out of memory. When one instance is processing requests, it could consume additional memory for KV cache and activations, which could cause the second instance has insufficient memory for KV cache to process requests.

Why SGLang can work? From my experience, even launching the same models, SGLang and vLLM could have different memory footprint. So in this case, it could be that SGLang just consume a bit less memory than vLLM, so it has room to process requests.

Oct 29 '25 15:10 jiarong0907

@shpgy-shpgy Thanks for providing the detailed nvidia-smi information.

Yeah, as I can see from the screenshots, the GPU is almost running out of memory. When one instance is processing requests, it could consume additional memory for KV cache and activations, which could cause the second instance has insufficient memory for KV cache to process requests.

Why SGLang can work? From my experience, even launching the same models, SGLang and vLLM could have different memory footprint. So in this case, it could be that SGLang just consume a bit less memory than vLLM, so it has room to process requests.

@jiarong0907 Got it! Thank you for your reply very much!

Oct 30 '25 03:10 shpgy-shpgy