vllm
vllm copied to clipboard
8bit support
Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes
In HF, we can run a 13B LLM on a 24G GPU with load_in_8bit=True
.
Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.
In the cloud, v100-32G is more expensive than A5000-24G 😭
Is there any way to save video memory usage? 😭
same with https://github.com/vllm-project/vllm/issues/214
Would love to see bitsandbytes integration to load models in 8 and 4-bit quantized mode.
Quantization support is crucial. 8 and 4 bit support is a must
why I use fastchat-vllm to inference vicuna-13B,It took 75 G of video memory(A800, 80G) @mymusise
why I use fastchat-vllm to inference vicuna-13B,It took 75 G of video memory(A800, 80G) @mymusise
@cabbagetalk you can add gpu_memory_utilization=0.4
to free your memory
This would especially useful for running the new meta-llama/Llama-2-70b-chat-hf
models!
Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes
In HF, we can run a 13B LLM on a 24G GPU with
load_in_8bit=True
.Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.
In the cloud, v100-32G is more expensive than A5000-24G 😭
Is there any way to save video memory usage? sob
在这里看到熟人啊
Does vllm support it now?
Any fix for this issue?
Hi guys, do you have plan to support it?
any fix for integrating bitsandbytes?
No fix, but the feature request is https://github.com/vllm-project/vllm/issues/4033