vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Why vLLM - GPTQ deployment consumes high memory?

Open RajdeepBorgohain opened this issue 1 year ago • 4 comments
trafficstars

Hi, We are working on creating a benchmark with Inferless/SOLAR-10.7B-Instruct-v1.0-GPTQ model, we found that when deploying the GPTQ version of the model with vLLM it consumes around 69GB of GPU, whereas AutoGPTQ consumes around 5.67GB. Deploying with vLLM does improve the token/sec a lot, but the concern is that it consumes very high GPU memory.

Here's a link to our tutorial: https://tutorials.inferless.com/deploy-quantized-version-of-solar-10.7b-instruct-using-inferless

image

RajdeepBorgohain avatar Jan 05 '24 12:01 RajdeepBorgohain

@chu-tianxiang

RajdeepBorgohain avatar Jan 05 '24 12:01 RajdeepBorgohain

vLLM reserves all the memory of the GPU for the KV caching and has an internal memory manager for allocating this reserved memory to specific request.

The model weights are still 5.67GB.

robertgshaw2-redhat avatar Jan 05 '24 16:01 robertgshaw2-redhat

@rib-2 Thanks a lot for your response, Would like to know if we can limit the reserve memory?

RajdeepBorgohain avatar Jan 06 '24 20:01 RajdeepBorgohain

https://docs.vllm.ai/en/latest/models/engine_args.html#cmdoption-gpu-memory-utilization

@RajdeepBorgohain

I would advise against reducing this unless you are trying to put multiple models on the same GPU or something like this. The whole point of using vLLM is to take advantage of smart memory allocation + usage for KV caching

robertgshaw2-redhat avatar Jan 06 '24 20:01 robertgshaw2-redhat

This is not clear from the docs. So you're saying it would make sense for a 4bit GPTQ Mistral 7B to take up >40GB VRAM if available, but that's not necessarily an issue if, say, many people are inferencing in parallel off a single server?

freckletonj avatar Jan 19 '24 07:01 freckletonj