vllm Load AWQ quantization model OOM !!!

example:

https://huggingface.co/TheBloke/CodeLlama-7B-AWQ: physical size is 4GB but use VRAM about 20GB
https://huggingface.co/TheBloke/deepseek-coder-33B-instruct-AWQ: physical size is 17GB but can not run on a dual-A100(40G) server.

--tensor-parallel-size 2 is configured

Why is that? Is it related to the model AWQ process or my VLLM usage?

@TheBloke May you please help me with this?

Nov 06 '23 12:11 bonuschild

This is normal and good - vLLM always uses nearly 100% VRAM, using the extra for caching.
Sorry I've not tested AWQ with tensor parallelism on vLLM, so no experience. You should definitely be able to run on 1x GPU though, even if TP doesn't work with AWQ.

Nov 06 '23 12:11 TheBloke

This is normal and good - vLLM always uses nearly 100% VRAM, using the extra for caching.

Sorry I've not tested AWQ with tensor parallelism on vLLM, so no experience. You should definitely be able to run on 1x GPU though, even if TP doesn't work with AWQ.

However, it does not use the extra part but the over part so that OOM occurs :(
I'd try on 1 GPU:
- success on Phind-Codellama-34B-V2-AWQ(max seq len is 16384 by default) with physical size is 18GB and finally use about 21GB VRAM;
- failed on deepseek-coder-33B-instruct-AWQ(max seq len is 65536 by default) but when I limit it by --max-model-len 16384 it turns OK, and 32768 turns Error again. Don't know the whereis the edge.

Any ideas? I only host 2*A100(40G) GPU.

Nov 06 '23 12:11 bonuschild

set --gpu-memory-utilization .8

Nov 06 '23 13:11 manishiitg

vLLM should have some memory manage issue when tp>1 for AWQ model. Also see #1472, after lots of test, I found that all the memory control parameters, include --max-num-batched-tokens, --max-model-len, --max-num-seqs, --gpu-memory-utilization, just can't work well when tp>1. Will meet OOM when concurrency is large enough.

Nov 08 '23 15:11 gesanqiu

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu

@bonuschild I think I have found the solution to this problem.

It appears that the issue is the derived_max_model_len. DeepSeek has a RoPE scaling factor of 4, which means that the derived_max_model_len becomes 65535, which is probably too big for your VRAM. If you use --max-model-len 16384 you should be able to run this model on a single GPU with <40GB VRAM.

Offending code block: https://github.com/vllm-project/vllm/blob/d79ced3292445d8471b3c4e5ce2dbf311834ec1b/vllm/config.py#L591-L598

Jan 30 '24 17:01 hmellor

vllm vllm copied to clipboard

Load AWQ quantization model OOM !!!

vllm
vllm copied to clipboard