vllm icon indicating copy to clipboard operation
vllm copied to clipboard

8bit support

Open mymusise opened this issue 1 year ago • 9 comments

Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes

In HF, we can run a 13B LLM on a 24G GPU with load_in_8bit=True.

Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.

In the cloud, v100-32G is more expensive than A5000-24G 😭

Is there any way to save video memory usage? 😭

mymusise avatar Jun 28 '23 16:06 mymusise

same with https://github.com/vllm-project/vllm/issues/214

mymusise avatar Jun 28 '23 16:06 mymusise

Would love to see bitsandbytes integration to load models in 8 and 4-bit quantized mode.

gururise avatar Jul 07 '23 15:07 gururise

Quantization support is crucial. 8 and 4 bit support is a must

generalsvr avatar Jul 08 '23 01:07 generalsvr

why I use fastchat-vllm to inference vicuna-13B,It took 75 G of video memory(A800, 80G) @mymusise

cabbagetalk avatar Jul 20 '23 06:07 cabbagetalk

why I use fastchat-vllm to inference vicuna-13B,It took 75 G of video memory(A800, 80G) @mymusise

@cabbagetalk you can add gpu_memory_utilization=0.4 to free your memory

mymusise avatar Jul 20 '23 07:07 mymusise

This would especially useful for running the new meta-llama/Llama-2-70b-chat-hf models!

proceduralia avatar Jul 22 '23 12:07 proceduralia

Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes

In HF, we can run a 13B LLM on a 24G GPU with load_in_8bit=True.

Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.

In the cloud, v100-32G is more expensive than A5000-24G 😭

Is there any way to save video memory usage? sob

在这里看到熟人啊

boxter007 avatar Aug 16 '23 08:08 boxter007

Does vllm support it now?

yhyu13 avatar Dec 21 '23 15:12 yhyu13

Any fix for this issue?

louis-csm avatar Jan 08 '24 17:01 louis-csm

Hi guys, do you have plan to support it?

warvyvr avatar Jan 26 '24 10:01 warvyvr

any fix for integrating bitsandbytes?

qashzar avatar May 06 '24 20:05 qashzar

No fix, but the feature request is https://github.com/vllm-project/vllm/issues/4033

hmellor avatar May 20 '24 22:05 hmellor