exllama icon indicating copy to clipboard operation
exllama copied to clipboard

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Results 99 exllama issues
Sort by recently updated
recently updated
newest added

Hi! I got this to work with [TheBloke/WizardLM-30B-Uncensored-GPTQ](https://huggingface.co/TheBloke/WizardLM-30B-Uncensored-GPTQ). Here's what worked: 1. This doesn't work on windows, but it does work on WSL 2. Download the model (and all files)...

https://github.com/SqueezeAILab/SqueezeLLM is this something exllama will support out of the box? how would integrating support look like?

We are trying to port the transformer based gen code to exllama but did not find a configurable `length_penalty` control. Will this be on the road map? Thanks.

I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation. ``` Latest bec6c9 25 t/s 34t/s ```...

Opening a new thread to continue conversation re: API as I think having a thread for discussion about this will be valuable as the project continues to scale Continuation from:...

I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference,...

Foremost, this is a terrific project. I've been trying to integrate it with other apps, but the API is a little bit different compared to other implementations like [KobolAI](https://github.com/KoboldAI/KoboldAI-Client) and...

According to this post, this is a method of rope scaling that result in less perplexity loss and a bigger possible scaling: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ the code can be found in this...

I'm kind of a newbie and probably it's not the right thing to ask, but maybe I can get pointed to the right direction. I have a FastAPI server and...

This adds support for the new NTK RoPE scaling, mentioned in https://github.com/turboderp/exllama/issues/115. "According to this post, this is a method of rope scaling that result in less perplexity loss and...