vllm Error gpu memory utilization with awq model when tp>1.

Right now vLLM will allocate 90% gpu memory for each accessible gpu card, but when launch server with awq model, it will became a unknow behavior. I run awq format codellama-13b(6.8GB) model on L4(24GB) and A40(48GB), and the gpu memory utilization is different, and ----gpu-memory-utilization and --swap-space parameter is invalid in this situation. On L4(24GB) it will pre-allocate ~10GB gpu memory: And when I send a request, it will increase ~2GB, the gpu memory usage come to ~12GB:

On A40(48GB) it will pre-allocate ~31GB gpu memory: And when I send the same request that I sent to L4, it will also increase ~2GB, the gpu memory usage come to ~33GB:

The only pattern I found in the phenomenon is they both allocate 1946MB gpu memory after I send a request to the server. In general, vLLM won't allocate memory after the engine launched, it's wired. Due to lack of gpu memory, this will cause two problem:

Only can process limited sequences in each iteration, on L4 the llm_engine will only have one sequnce in the running queue.
When the max_new_token is large, due to the full use of KV cache, llm_engine will run into stuck , also see #1206

But can't reproduce this on llama-2-13b-hf awq model. @WoosukKwon @zhuohan123 Do you have any ideads about this issue?

Oct 25 '23 12:10 gesanqiu

After tracing the whole gpu memory allocate flow, I found that vLLM will do a forward before init cache. Due to lack of truely INT4 kernel, vLLM will de-quantize the weight to FP16, so the peak gpu memory is large than it intend to be, which devided the cache size: num_gpu_blocks = int((total_gpu_memory * gpu_memory_utilization - peak_memory) // cache_block_size). Codellama-13b has 16K context length, awq format model will cost ~10GB extra gpu memory to executing one forward. So on L4(24GB), only 23*0.9 - 8 - 10 = ~2.7GB gpu memory can use for CacheEngine. Right now it's better to calculate memory cost in FP16 format when deploy quantized mdel. But vLLM will stuck when KVCache is full loaded, still need a solution.

Oct 26 '23 07:10 gesanqiu

How do you calculate that peak_memory is 8?

Oct 27 '23 07:10 LovesportsMcDull

Hi @gesanqiu, there seems to be a big difference between vllm's quant cuda kernels and llm-awq's one, especially around dequantization. Also, a few users reported that FastChat AWQ implementation works just fine (they use tinychat, which uses llm-awq to load the quantized models). I have no experience with kernels, but could this be fixed by using llm-awq kernels? Or even let vllm use tinychat to load quantized AWQ models?

Nov 01 '23 13:11 roelschr

After more test, I found that awq model still has some error when using tp > 1. When I setup the tp=2, it will cause OOM error when concurrcy is big enough, it seems the --max-num-batched-token or init_cache() didn't work well in tp>1 situation. Even I set the --gpu-memory-utilization 0.5 it still use the full of my gpu memory to process as much as request as it can.

Nov 02 '23 10:11 gesanqiu

Hi @gesanqiu, there seems to be a big difference between vllm's quant cuda kernels and llm-awq's one, especially around dequantization. Also, a few users reported that FastChat AWQ implementation works just fine (they use tinychat, which uses llm-awq to load the quantized models). I have no experience with kernels, but could this be fixed by using llm-awq kernels? Or even let vllm use tinychat to load quantized AWQ models?

Sorry I'm also not a CUDA expert, can't help with you idea.

Nov 02 '23 10:11 gesanqiu

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu

vllm vllm copied to clipboard

Error gpu memory utilization with awq model when tp>1.

vllm
vllm copied to clipboard