vllm 8bit support

Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes

In HF, we can run a 13B LLM on a 24G GPU with load_in_8bit=True.

Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.

In the cloud, v100-32G is more expensive than A5000-24G 😭

Is there any way to save video memory usage? 😭

Jun 28 '23 16:06 mymusise

same with https://github.com/vllm-project/vllm/issues/214

Jun 28 '23 16:06 mymusise

Would love to see bitsandbytes integration to load models in 8 and 4-bit quantized mode.

Jul 07 '23 15:07 gururise

Quantization support is crucial. 8 and 4 bit support is a must

Jul 08 '23 01:07 generalsvr

why I use fastchat-vllm to inference vicuna-13B，It took 75 G of video memory(A800, 80G) @mymusise

Jul 20 '23 06:07 cabbagetalk

why I use fastchat-vllm to inference vicuna-13B，It took 75 G of video memory(A800, 80G) @mymusise

@cabbagetalk you can add gpu_memory_utilization=0.4 to free your memory

Jul 20 '23 07:07 mymusise

This would especially useful for running the new meta-llama/Llama-2-70b-chat-hf models!

Jul 22 '23 12:07 proceduralia

Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes

In HF, we can run a 13B LLM on a 24G GPU with load_in_8bit=True.

Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.

In the cloud, v100-32G is more expensive than A5000-24G 😭

Is there any way to save video memory usage? sob

在这里看到熟人啊