exllama issues

Working with TheBloke/WizardLM-30B-Uncensored-GPTQ

4

Hi! I got this to work with [TheBloke/WizardLM-30B-Uncensored-GPTQ](https://huggingface.co/TheBloke/WizardLM-30B-Uncensored-GPTQ). Here's what worked: 1. This doesn't work on windows, but it does work on WSL 2. Download the model (and all files)...

gabriel-peracio

SqueezeLLM Support?

1

https://github.com/SqueezeAILab/SqueezeLLM is this something exllama will support out of the box? how would integrating support look like?

nikshepsvn

Feature Request: length_penalty support

3

We are trying to port the transformer based gen code to exllama but did not find a configurable `length_penalty` control. Will this be on the road map? Thanks.

Qubitium

Performance degradation

20

I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation. ``` Latest bec6c9 25 t/s 34t/s ```...

dvoidus

ExLlama API spec / discussion

6

Opening a new thread to continue conversation re: API as I think having a thread for discussion about this will be valuable as the project continues to scale Continuation from:...

nikshepsvn

will it work with Nvidia P40 24GB on Linux?

29

I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference,...

waan1

Streaming API

5

Foremost, this is a terrific project. I've been trying to integrate it with other apps, but the API is a little bit different compared to other implementations like [KobolAI](https://github.com/KoboldAI/KoboldAI-Client) and...

bkutasi

NTK RoPE scaling.

23

According to this post, this is a method of rope scaling that result in less perplexity loss and a bigger possible scaling: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ the code can be found in this...

alkeryn

Is there any way to support multiple parallel generation request to the same model?

11

I'm kind of a newbie and probably it's not the right thing to ask, but maybe I can get pointed to the right direction. I have a FastAPI server and...

marcoripa96

(Experimental) Add support to NTK RoPE scaling

This adds support for the new NTK RoPE scaling, mentioned in https://github.com/turboderp/exllama/issues/115. "According to this post, this is a method of rope scaling that result in less perplexity loss and...

Panchovix

exllama
exllama copied to clipboard

Metadata

Working with TheBloke/WizardLM-30B-Uncensored-GPTQ

SqueezeLLM Support?

Feature Request: length_penalty support

Performance degradation

ExLlama API spec / discussion

will it work with Nvidia P40 24GB on Linux?

Streaming API

NTK RoPE scaling.

Is there any way to support multiple parallel generation request to the same model?

(Experimental) Add support to NTK RoPE scaling

← Metadata

Owner

Metadata

exllama exllama copied to clipboard

Metadata

← Metadata

Owner

Metadata

exllama
exllama copied to clipboard