vllm Why vLLM - GPTQ deployment consumes high memory?

Why vLLM - GPTQ deployment consumes high memory?

Open RajdeepBorgohain opened this issue 1 year ago • 4 comments

trafficstars

Hi, We are working on creating a benchmark with Inferless/SOLAR-10.7B-Instruct-v1.0-GPTQ model, we found that when deploying the GPTQ version of the model with vLLM it consumes around 69GB of GPU, whereas AutoGPTQ consumes around 5.67GB. Deploying with vLLM does improve the token/sec a lot, but the concern is that it consumes very high GPU memory.

Here's a link to our tutorial: https://tutorials.inferless.com/deploy-quantized-version-of-solar-10.7b-instruct-using-inferless

Jan 05 '24 12:01 RajdeepBorgohain

@chu-tianxiang

Jan 05 '24 12:01 RajdeepBorgohain

vLLM reserves all the memory of the GPU for the KV caching and has an internal memory manager for allocating this reserved memory to specific request.

The model weights are still 5.67GB.

Jan 05 '24 16:01 robertgshaw2-redhat

@rib-2 Thanks a lot for your response, Would like to know if we can limit the reserve memory?

Jan 06 '24 20:01 RajdeepBorgohain

https://docs.vllm.ai/en/latest/models/engine_args.html#cmdoption-gpu-memory-utilization

@RajdeepBorgohain

I would advise against reducing this unless you are trying to put multiple models on the same GPU or something like this. The whole point of using vLLM is to take advantage of smart memory allocation + usage for KV caching

Jan 06 '24 20:01 robertgshaw2-redhat

This is not clear from the docs. So you're saying it would make sense for a 4bit GPTQ Mistral 7B to take up >40GB VRAM if available, but that's not necessarily an issue if, say, many people are inferencing in parallel off a single server?

Jan 19 '24 07:01 freckletonj

vllm vllm copied to clipboard

Why vLLM - GPTQ deployment consumes high memory?

vllm
vllm copied to clipboard