server icon indicating copy to clipboard operation
server copied to clipboard

Exllamav2 inference with EXL Quants

Open rjmehta1993 opened this issue 1 year ago • 1 comments

Do you support Exllamav2 backend for the inference that supports exl quants?

The current alternative is vllm but that doesn't support EXL quants. Also, after running a perplexity test, EXL is the best.

Transformers supports exllamav2 backend. But that has a very poor tokens/sec throughput.

rjmehta1993 avatar Jul 26 '24 04:07 rjmehta1993

I believe this can be deployed through custom python-based backend in a similar way we have a vllm backend: https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py

oandreeva-nv avatar Jul 31 '24 22:07 oandreeva-nv