exllama issues

stop-string support?

2

I'm using ExLLama with the Oobabooga text-generation UI. With the model: TheBloke_llama2_70b_chat_uncensored-GPTQ The model works great, but using ExLLama as a loader the model talks to itself, generating it's own...

krypterro

Is it possible to do batch generate?

7

So, I'm trying to do batch generate using code made by oogabooga in text generation webui by calling generate method of ExllamaHF. But, error was thrown. I guess because Exllama...

fahadh4ilyas

Request: Some improvements to web app.py

As in the title. The web app is very nice and simple and clean. It works without any fuss and doesn't have any of the vram or otherwise overhead of...

Midaychi

Latency grows substantially as batch size increases, even with small batch sizes

2

Thanks for the wonderful repo, @turboderp! I'm benchmarking latency on an A100 and I've observed latency increasing substantially as I increase batch size–to much larger degree than I'm used to...

joehoover

Modify generator.py > generate_simple to accept encode_special_characters?

1

Hello. I noticed a couple of recent PRs added the [encode_special_characters parameter option inside the tokenizer](https://github.com/turboderp/exllama/blob/master/tokenizer.py#L25). This is great because right now I don't think exllama by default encodes special...

zmarty

Slower tokens/s than expecting

14

Hello I am running a 2x 4090 PC, Windows, with exllama on 7b llama-2. I am only getting ~70-75t/s during inference (using just 1x 4090), but based on the charts,...

teknium1

Continuous Batching support

vLLM, and HF's TGI can do this. Additional Context: https://github.com/turboderp/exllama/issues/150#issuecomment-1633417028

FireMasterK

Question about example_flask.py

1

I found an example regarding using Flask for API requests. I gave it a try, but when making concurrent requests, the generated responses from the inference appear as garbled text....

ZeroYuJie

Support for NF4?

1

Is there a plan to include support for the NF4 data type from the qlora paper?

hoagy-davis-digges

Speculative decoding?

17

https://github.com/dust-tt/llama-ssp Any plans to implement speculative decoding? Would probably improve latency by at least 2x and seems not too difficult to implement.

bryanhpchiang

exllama
exllama copied to clipboard

Metadata

stop-string support?

Is it possible to do batch generate?

Request: Some improvements to web app.py

Latency grows substantially as batch size increases, even with small batch sizes

Modify generator.py > generate_simple to accept encode_special_characters?

Slower tokens/s than expecting

Continuous Batching support

Question about example_flask.py

Support for NF4?

Speculative decoding?

← Metadata

Owner

Metadata

exllama exllama copied to clipboard

Metadata

← Metadata

Owner

Metadata

exllama
exllama copied to clipboard