vllm
vllm copied to clipboard
Why vLLM - GPTQ deployment consumes high memory?
Hi, We are working on creating a benchmark with Inferless/SOLAR-10.7B-Instruct-v1.0-GPTQ model, we found that when deploying the GPTQ version of the model with vLLM it consumes around 69GB of GPU, whereas AutoGPTQ consumes around 5.67GB. Deploying with vLLM does improve the token/sec a lot, but the concern is that it consumes very high GPU memory.
Here's a link to our tutorial: https://tutorials.inferless.com/deploy-quantized-version-of-solar-10.7b-instruct-using-inferless
@chu-tianxiang
vLLM reserves all the memory of the GPU for the KV caching and has an internal memory manager for allocating this reserved memory to specific request.
The model weights are still 5.67GB.
@rib-2 Thanks a lot for your response, Would like to know if we can limit the reserve memory?
https://docs.vllm.ai/en/latest/models/engine_args.html#cmdoption-gpu-memory-utilization
@RajdeepBorgohain
I would advise against reducing this unless you are trying to put multiple models on the same GPU or something like this. The whole point of using vLLM is to take advantage of smart memory allocation + usage for KV caching
This is not clear from the docs. So you're saying it would make sense for a 4bit GPTQ Mistral 7B to take up >40GB VRAM if available, but that's not necessarily an issue if, say, many people are inferencing in parallel off a single server?