633WHU comments

Results 30 comments of


                                            633WHU

GPU KV cache usage: 100.0%以后就卡住

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any...

GPU KV cache usage: 100.0%以后就卡住

> Also encountered the same problem, how to solve this problem? @viktor-ferenczi @WoosukKwon At present, we have found a workaround and set the swap space directly to 0. This way,...

API server abort all request for no reason

> ### **Bug Description** > After running and test vllm successfully with [NousResearch/Llama-2-7b-chat-hf](https://huggingface.co/NousResearch/Llama-2-7b-chat-hf) and [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ), I change llm to [vilm/vinallama-2.7b-chat](https://huggingface.co/vilm/vinallama-2.7b-chat) - a llama-2 family model. This time the API server...

BUG: swap_size - when distributed serving very large LMs

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any...

KV cache is low, memory profiling does not see the remaining VRAM

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any...

Load AWQ quantization model OOM !!!

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any...

Error gpu memory utilization with awq model when tp>1.

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any...

What happens when GPU KV cache usage reaching 100%?

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any...

MCP Server Session Lost in Multi-Worker Environment

> I am facing the same problem. In our case we do not use multiple workers ,but in K8 each pod is a worker running the server. Due to autoscaling...

307 Temporary Redirect

set your streamable_http_path='/mcp/'