vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Feature request:support ExLlama

Open alanxmay opened this issue 1 year ago • 7 comments

ExLlama (https://github.com/turboderp/exllama)

It's currently the fastest and most memory-efficient executor of models that I'm aware of.

Is there an interest from the maintainers in adding this support?

alanxmay avatar Jun 28 '23 17:06 alanxmay

How do you plan on adding batched support for Exllama? I am very interested in your approach as I am trying to work on that too

SinanAkkoyun avatar Jul 22 '23 10:07 SinanAkkoyun

ExLlamaV2 has taken over ExLlama in quantization performance for most cases. I hope we can get it implemented in vLLM because it is also an incredible quantization technique. Benchmarks between all the big quantization techniques indicate ExLlamaV2 is the best out of all of them. Have there been any new developments since it was added to the roadmap?

iibw avatar Dec 21 '23 19:12 iibw

Please, having exllamav2 with paged attention and with continuous batching would be a big win for the LLM world

SinanAkkoyun avatar Dec 24 '23 03:12 SinanAkkoyun

Also looking forward to exllamav2 support

DaBossCoda avatar Dec 24 '23 06:12 DaBossCoda

I was hoping this would be possible, too. I recently worked with the Mixtral-8x7b Model; AWQ 4-bit had significant OOM / Memory overhead compared to ExLlama2 in 4-Bit; also I ended up just running the model in 8-bit using ExLlama2, since that turned out to be the best compromise between model capabilities and VRAM footprint. I can run it in 8-bit on 3x3090 and use full 32k context with ExLlama2; but I need 4x3090 to be even able to load it in 16-bit within VLLM; and i reach OOM when I try to use full context.

So this would definitely be an amazing addition to have more flexibility in terms of VRAM-Resources.

RuntimeRacer avatar Jan 01 '24 16:01 RuntimeRacer

+1

theobjectivedad avatar Jan 04 '24 04:01 theobjectivedad

+1

tolecy avatar Jan 05 '24 04:01 tolecy

+1

chricro avatar Feb 23 '24 20:02 chricro

+1

agahEbrahimi avatar Mar 01 '24 05:03 agahEbrahimi

+1

a-creation avatar Mar 04 '24 02:03 a-creation

This will be the biggest release for vllm to support exllamav2. +1

rjmehta1993 avatar Apr 04 '24 15:04 rjmehta1993

+1

sapountzis avatar Apr 30 '24 20:04 sapountzis

+1

kulievvitaly avatar Jun 25 '24 20:06 kulievvitaly