vllm KV cache is low, memory profiling does not see the remaining VRAM

GPUs: 2x 4090 (2x24GB)

Regarding my long context issue with CodeLlama above:

vLLM 0.2.3: # GPU blocks: 1464, # CPU blocks: 1310
vLLM 0.2.4: # GPU blocks: 1464, # CPU blocks: 1310
vLLM 0.2.5 and main: # GPU blocks: 112, # CPU blocks: 1310

Something broke in VRAM profiling or before that, which prevents vLLM from using all remaining VRAM for the KV cache. Profiling already gives too low values and there is no way to manually override it from the command line. Both GPUs had ~8GB free VRAM after loading the model, so vLLM just fails to allocate it as cache.

Command:

python -O -u -m vllm.entrypoints.openai.api_server \
  --model=TheBloke/CodeLlama-13B-Instruct-fp16 \
  --chat-template=$HOME/bin/templates/llama-2-chat.jinja \
  --served-model-name=model \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.95 \
  --disable-log-requests

Tested OK up to the full 16k context window on vLLM 0.2.3 and 0.2.4. The test fails on 0.2.5 if the sequence is longer than about 1700 tokens. (I think the exact limit is 112 * 16 due to block manager allocation and the block size of 16.)

vLLM 0.2.5 (and main) works fine with TheBloke/deepseek-coder-33B-instruct-AWQ, the problem does not happen with that model.

The use of --chat-template does not affect the problem, that's only to get the chat template right (same as Llama-2).

I've tried to change all the meaningful command line options in many ways, none of them helped.

Dec 15 '23 22:12 viktor-ferenczi

Hi @viktor-ferenczi, thanks for reprint the issue. I believe this was fixed by #2151. Please try out v0.2.6!

Dec 18 '23 00:12 WoosukKwon

Thanks for the fix. However, testing vLLM 0.2.6 with the same command failed right away.

To get it even doing any inference I had to add the --enforce-eager parameter.

It could load and run inference, however about ~34% lower total throughput than before (~500t/s => ~330t/s).

It starts with more GPU KV cache blocks, which is good:

# GPU blocks: 2751, # CPU blocks: 4228

But it crashed without any meaningful error in the log at ~69% KV cache load:

INFO 12-18 10:49:47 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 315.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 68.9%, CPU KV cache usage: 0.0%

What is the best way to provide you with more details?

Had to revert to vLLM 0.2.5 and restarting it whenever it freezes (keepalive).

Dec 18 '23 10:12 viktor-ferenczi

Hi @viktor-ferenczi, could you provide a reproducible script?

Dec 18 '23 18:12 WoosukKwon

Sure. I will attempt to crash it with 16 parallel long context (near 16k) completions.

Dec 19 '23 07:12 viktor-ferenczi

Should I test 0.2.6 or main branch version?

Dec 19 '23 07:12 viktor-ferenczi

@WoosukKwon the vllm model run into infinity it keeps running that 2 request for ever when tried with mistral 7b instruct and not responding to new request. Is this a known issue as we need to restart it back to start responding. (Note the request prompt was large)

20/12/2023, 12:10:11 AM INFO 12-19 18:40:11 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.7 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 98.7%, CPU KV cache usage: 0.0%

Dec 20 '23 16:12 Rahmat711

@Rahmat711 Which vLLM version are you using?

Dec 20 '23 20:12 viktor-ferenczi

@viktor-ferenczi i am using version 0.2.6

Dec 20 '23 23:12 Rahmat711

@WoosukKwon Tested main branch a1b9cb2a version:

GPUs: 2x4090 (2x24GB)

`CodeLlama-13B-Instruct-fp16`

Loads OK # GPU blocks: 1159, # CPU blocks: 1310 Works up to 16k context, however GPU KV cache reaches ~82% in my test.

`TheBloke/CodeLlama-34B-Instruct-AWQ`

python -O -u -m vllm.entrypoints.openai.api_server \
  --model=TheBloke/CodeLlama-34B-Instruct-AWQ \
  --chat-template=chat-templates/llama-2-chat.jinja \
  --quantization=awq \
  --dtype=float16 \
  --served-model-name=model \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --disable-log-requests \
  --enforce-eager \
  --gpu-memory-utilization=0.95

Crashes during model load without any meaningful error log, it just dies there:

INFO 12-22 23:54:27 api_server.py:727] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name='model', chat_template='/home/viktor/bin/chat-templates/llama-2-chat.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, model='/home/viktor/models/TheBloke/CodeLlama-34B-Instruct-AWQ', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='float16', max_model_len=16384, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=8, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_seqs=16, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=True, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=True, max_log_len=None)
WARNING 12-22 23:54:27 config.py:463] Casting torch.bfloat16 to torch.float16.
WARNING 12-22 23:54:27 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2023-12-22 23:54:29,255 INFO worker.py:1673 -- Started a local Ray instance.
INFO 12-22 23:54:29 llm_engine.py:74] Initializing an LLM engine with config: model='/home/viktor/models/TheBloke/CodeLlama-34B-Instruct-AWQ', tokenizer='/home/viktor/models/TheBloke/CodeLlama-34B-Instruct-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=True, seed=0)

0.2.4 could load this model (exact same command without the eager parameter) and it works fine with: # GPU blocks: 3924, # CPU blocks: 5461
0.2.5 could load this model with --gpu-memory-utilization=0.95, but ended up with only minimal # GPU blocks: 112, so contexts longer than 1700 tokens fail
0.2.6 and main branch a1b9cb2a version cannot even load this model anymore, at least for me

Something went broken in 0.2.5 and got worse in 0.2.6.

How can I debug where it crashes? I guess just trying to log more/verbose or stepping it in debugger until it crashes, then narrowing down. Any other easier technique?

Dec 22 '23 22:12 viktor-ferenczi

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu

GPU KV cache

hello, can you show me how to increase GPU KV cache? here is my log , when I run gemma7b,it will take 30s once, soooo slow, how can I make it faster? INFO 05-22 02:20:59 async_llm_engine.py:524] Received request 10390bd4d2dc4936bd1a62e5793a4fd8: prompt: '<start_of_turn>user\nplease tell me something about python<end_of_turn>\n<start_of_turn>model\n', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=0.8, top_k=1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[1], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None. INFO 05-22 02:20:59 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0% INFO 05-22 02:21:04 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0% INFO 05-22 02:21:09 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0% INFO 05-22 02:21:14 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0% INFO 05-22 02:21:19 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.2%, CPU KV cache usage: 0.0% INFO 05-22 02:21:24 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0% INFO 05-22 02:21:29 async_llm_engine.py:120] Finished request 10390bd4d2dc4936bd1a62e5793a4fd8.

May 22 '24 02:05 adogwangwang

vllm vllm copied to clipboard

KV cache is low, memory profiling does not see the remaining VRAM

CodeLlama-13B-Instruct-fp16

TheBloke/CodeLlama-34B-Instruct-AWQ

vllm
vllm copied to clipboard

`CodeLlama-13B-Instruct-fp16`

`TheBloke/CodeLlama-34B-Instruct-AWQ`