emvw7yf

Results 3 issues of emvw7yf

Awesome project, thanks! Does it support sharding large models across multiple GPUs, or would this be in scope for this project in the future?

feature request

Based on experimenting with GPTQ-for-LLaMa, int4 quantization seems to introduce 3-5% drop in perplexity, while int8 is almost identical to fp16. Would it be possible to use int8 quantization with...

feature request

TL;DR: the patch below makes multi-GPU inference 5x faster. I noticed that text-generation is significantly slower on multi-GPU vs. single-GPU. Some results (using llama models and utilizing the full 2048...