vllm icon indicating copy to clipboard operation
vllm copied to clipboard

vLLM model serving server hangs when GPU KV cache usage reaches 10%

Open karanpathak opened this issue 2 years ago • 3 comments

Hello everybody,

The server hangs when the GPU KV cache usage reaches 10%.

Issue in Detail

I attempted to serve the Llama2 7B Hugging Face model via vLLM on GPU by following the API Server Quickstart guide. In some instances, the model serving works correctly, processing the input prompt request successfully and returning a response. However, there are times when the model serving server accepts the input prompt request but becomes unresponsive and fails to provide any response to our request.

Upon debugging, I identified that the issue is related to the GPU Key-Value cache usage. The model operates as expected when the GPU Key-Value cache usage is at 0.8%, but the server becomes unresponsive when it reaches 10%.

working_case The first image illustrates a successful case where the input prompt request works as expected, and the GPU KV cache usage is 0.8% (highlighted in yellow)

non_working_case The second image depicts a non-successful case where the model serving server accepts an input prompt request, but the server becomes unresponsive, providing no response. This occurs when the GPU KV cache usage is at 10% (highlighted in yellow).

Details

  • GPU: NVIDIA Tesla T4 (16GB vRAM)
  • CUDA Version: 12.0
  • Driver Version: 525.85.12
  • vLLM version: 0.2.1.post1
  • Hugginface model: meta-llama/Llama-2-7b-hf

Steps to the replicate the error

  1. Open a Terminal (terminal 1)
  2. Login into Huggingface using its CLI: huggingface-cli login --token <your-token>
  3. In terminal 1, execute the following command: python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf
  4. Open another Terminal (terminal 2)
  5. In terminal 2, execute the following command: curl http://localhost:8000/generate -d '{"prompt": "San Francisco is a", "use_beam_search": true, "n": 4, "temperature": 0}'

Please ensure that you have the respective vLLM package and the CUDA toolkit.

karanpathak avatar Dec 05 '23 04:12 karanpathak

I believe this is similar to #1879. While T4 can run a 7B model, the throughput will be very very low and vLLM will likely perform a lot of eviction and recomputing to deal with the little memory available left for computation. For 16G device memory, the model itself takes 14G which leaves very little for computation and KV cache.

simon-mo avatar Dec 05 '23 06:12 simon-mo

Also happens on vLLM 0.2.3, 0.2.4, 0.2.5 and main while running any model tensor-parallel on 2x 4090 (2x4090). Once the VRAM KV cache is full vLLM hangs, it just stops running any processing on the GPU. It does not even try to swap anything out to the CPU KV cache.

Last log lines before the freeze:

INFO 12-16 00:57:56 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 316.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 84.9%, CPU KV cache usage: 0.0%
INFO 12-16 00:58:01 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 311.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 91.5%, CPU KV cache usage: 0.0%
INFO 12-16 00:58:06 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 311.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 97.3%, CPU KV cache usage: 0.0%

On killing it with SIGTERM (or Ctrl-C):

^CINFO:     Shutting down
INFO:     Waiting for background tasks to complete. (CTRL+C to force quit)

Then it continues to hang.

Command:

python -O -u -m vllm.entrypoints.openai.api_server \
  --model=TheBloke/CodeLlama-34B-Instruct-AWQ \
  --chat-template=$HOME/bin/templates/llama-2-chat.jinja \
  --quantization=awq \
  --dtype=float16 \
  --served-model-name=model \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.8 \
  --disable-log-requests

The chat template does not matter, that's just to get it right with the CodeLlama model.

viktor-ferenczi avatar Dec 16 '23 00:12 viktor-ferenczi

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

chi2liu avatar Jan 04 '24 07:01 chi2liu