vllm vLLM model serving server hangs when GPU KV cache usage reaches 10%

Hello everybody,

The server hangs when the GPU KV cache usage reaches 10%.

Issue in Detail

I attempted to serve the Llama2 7B Hugging Face model via vLLM on GPU by following the API Server Quickstart guide. In some instances, the model serving works correctly, processing the input prompt request successfully and returning a response. However, there are times when the model serving server accepts the input prompt request but becomes unresponsive and fails to provide any response to our request.

Upon debugging, I identified that the issue is related to the GPU Key-Value cache usage. The model operates as expected when the GPU Key-Value cache usage is at 0.8%, but the server becomes unresponsive when it reaches 10%.

working_case The first image illustrates a successful case where the input prompt request works as expected, and the GPU KV cache usage is 0.8% (highlighted in yellow)

non_working_case The second image depicts a non-successful case where the model serving server accepts an input prompt request, but the server becomes unresponsive, providing no response. This occurs when the GPU KV cache usage is at 10% (highlighted in yellow).

Details

GPU: NVIDIA Tesla T4 (16GB vRAM)
CUDA Version: 12.0
Driver Version: 525.85.12
vLLM version: 0.2.1.post1
Hugginface model: meta-llama/Llama-2-7b-hf

Steps to the replicate the error

Open a Terminal (terminal 1)
Login into Huggingface using its CLI: huggingface-cli login --token <your-token>
In terminal 1, execute the following command: python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf
Open another Terminal (terminal 2)
In terminal 2, execute the following command: curl http://localhost:8000/generate -d '{"prompt": "San Francisco is a", "use_beam_search": true, "n": 4, "temperature": 0}'

Please ensure that you have the respective vLLM package and the CUDA toolkit.

Dec 05 '23 04:12 karanpathak

I believe this is similar to #1879. While T4 can run a 7B model, the throughput will be very very low and vLLM will likely perform a lot of eviction and recomputing to deal with the little memory available left for computation. For 16G device memory, the model itself takes 14G which leaves very little for computation and KV cache.

Dec 05 '23 06:12 simon-mo

Also happens on vLLM 0.2.3, 0.2.4, 0.2.5 and main while running any model tensor-parallel on 2x 4090 (2x4090). Once the VRAM KV cache is full vLLM hangs, it just stops running any processing on the GPU. It does not even try to swap anything out to the CPU KV cache.

Last log lines before the freeze:

INFO 12-16 00:57:56 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 316.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 84.9%, CPU KV cache usage: 0.0%
INFO 12-16 00:58:01 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 311.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 91.5%, CPU KV cache usage: 0.0%
INFO 12-16 00:58:06 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 311.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 97.3%, CPU KV cache usage: 0.0%

On killing it with SIGTERM (or Ctrl-C):

^CINFO:     Shutting down
INFO:     Waiting for background tasks to complete. (CTRL+C to force quit)

Then it continues to hang.

Command:

python -O -u -m vllm.entrypoints.openai.api_server \
  --model=TheBloke/CodeLlama-34B-Instruct-AWQ \
  --chat-template=$HOME/bin/templates/llama-2-chat.jinja \
  --quantization=awq \
  --dtype=float16 \
  --served-model-name=model \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.8 \
  --disable-log-requests

The chat template does not matter, that's just to get it right with the CodeLlama model.

Dec 16 '23 00:12 viktor-ferenczi

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu