emvw7yf issues

Repositories
Issues
Comments

Results 3 issues of


emvw7yf

Multi-GPU support for larger-than-VRAM models

Awesome project, thanks! Does it support sharding large models across multiple GPUs, or would this be in scope for this project in the future?

feature request

Consider supporting 8bit quantization

Based on experimenting with GPTQ-for-LLaMa, int4 quantization seems to introduce 3-5% drop in perplexity, while int8 is almost identical to fp16. Would it be possible to use int8 quantization with...

feature request

5x faster text generation on multi-GPU setups (+ lower VRAM consumption)

TL;DR: the patch below makes multi-GPU inference 5x faster. I noticed that text-generation is significantly slower on multi-GPU vs. single-GPU. Some results (using llama models and utilizing the full 2048...