exllama icon indicating copy to clipboard operation
exllama copied to clipboard

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Results 99 exllama issues
Sort by recently updated
recently updated
newest added

I'm using ExLLama with the Oobabooga text-generation UI. With the model: TheBloke_llama2_70b_chat_uncensored-GPTQ The model works great, but using ExLLama as a loader the model talks to itself, generating it's own...

So, I'm trying to do batch generate using code made by oogabooga in text generation webui by calling generate method of ExllamaHF. But, error was thrown. I guess because Exllama...

As in the title. The web app is very nice and simple and clean. It works without any fuss and doesn't have any of the vram or otherwise overhead of...

Thanks for the wonderful repo, @turboderp! I'm benchmarking latency on an A100 and I've observed latency increasing substantially as I increase batch size–to much larger degree than I'm used to...

Hello. I noticed a couple of recent PRs added the [encode_special_characters parameter option inside the tokenizer](https://github.com/turboderp/exllama/blob/master/tokenizer.py#L25). This is great because right now I don't think exllama by default encodes special...

Hello I am running a 2x 4090 PC, Windows, with exllama on 7b llama-2. I am only getting ~70-75t/s during inference (using just 1x 4090), but based on the charts,...

vLLM, and HF's TGI can do this. Additional Context: https://github.com/turboderp/exllama/issues/150#issuecomment-1633417028

I found an example regarding using Flask for API requests. I gave it a try, but when making concurrent requests, the generated responses from the inference appear as garbled text....

Is there a plan to include support for the NF4 data type from the qlora paper?

https://github.com/dust-tt/llama-ssp Any plans to implement speculative decoding? Would probably improve latency by at least 2x and seems not too difficult to implement.