server
server copied to clipboard
Exllamav2 inference with EXL Quants
Do you support Exllamav2 backend for the inference that supports exl quants?
The current alternative is vllm but that doesn't support EXL quants. Also, after running a perplexity test, EXL is the best.
Transformers supports exllamav2 backend. But that has a very poor tokens/sec throughput.
I believe this can be deployed through custom python-based backend in a similar way we have a vllm backend: https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py