vllm
vllm copied to clipboard
Load AWQ quantization model OOM !!!
example:
- https://huggingface.co/TheBloke/CodeLlama-7B-AWQ: physical size is 4GB but use VRAM about 20GB
- https://huggingface.co/TheBloke/deepseek-coder-33B-instruct-AWQ: physical size is 17GB but can not run on a dual-A100(40G) server.
--tensor-parallel-size 2
is configured
Why is that? Is it related to the model AWQ process or my VLLM usage?
@TheBloke May you please help me with this?
- This is normal and good - vLLM always uses nearly 100% VRAM, using the extra for caching.
- Sorry I've not tested AWQ with tensor parallelism on vLLM, so no experience. You should definitely be able to run on 1x GPU though, even if TP doesn't work with AWQ.
- This is normal and good - vLLM always uses nearly 100% VRAM, using the extra for caching.
- Sorry I've not tested AWQ with tensor parallelism on vLLM, so no experience. You should definitely be able to run on 1x GPU though, even if TP doesn't work with AWQ.
- However, it does not use the extra part but the over part so that OOM occurs :(
- I'd try on 1 GPU:
- success on Phind-Codellama-34B-V2-AWQ(max seq len is 16384 by default) with physical size is 18GB and finally use about 21GB VRAM;
- failed on deepseek-coder-33B-instruct-AWQ(max seq len is 65536 by default) but when I limit it by
--max-model-len 16384
it turns OK, and32768
turns Error again. Don't know the whereis the edge.
Any ideas? I only host 2*A100(40G) GPU.
set --gpu-memory-utilization .8
vLLM should have some memory manage issue when tp>1 for AWQ model. Also see #1472, after lots of test, I found that all the memory control parameters, include --max-num-batched-tokens
, --max-model-len
, --max-num-seqs
, --gpu-memory-utilization
, just can't work well when tp>1. Will meet OOM when concurrency is large enough.
At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.
@bonuschild I think I have found the solution to this problem.
It appears that the issue is the derived_max_model_len
. DeepSeek has a RoPE scaling factor of 4, which means that the derived_max_model_len
becomes 65535
, which is probably too big for your VRAM. If you use --max-model-len 16384
you should be able to run this model on a single GPU with <40GB VRAM.
Offending code block: https://github.com/vllm-project/vllm/blob/d79ced3292445d8471b3c4e5ce2dbf311834ec1b/vllm/config.py#L591-L598