exllama
exllama copied to clipboard
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Hi, Sorry if this is obvious : ) but, I'm trying to build the Docker container. It says to "First, set the `MODEL_PATH` and `SESSIONS_PATH` variables in the `.env` file...
This line in `generator.py` yields infinite logits when temperature is set to 0: https://github.com/turboderp/exllama/blob/c16cf49c3f19e887da31d671a713619c8626484e/generator.py#L106C1-L106C30  Debugger result: 
 Hi, I was prompting llama-2-7B and got into this error. Can you please handle the case there are nans in the logits?
I am using exllama through the oobabooga text-generation-webui with AMD/ROCm. I cloned exllama into the text-generation-webui/repositories folder and installed dependencies. Devices: 2x AMD Instinct MI60 gfx906 Distro: Ubuntu 20.04.6 Kernel:...
I couldn't reopen my original issue so I hope its fine if I open another bug. The pascal fix is broken again, at least for me. The following check does...
I am having bad quality results with prompts longer than 2048 tokens with a LoRA trained with alpaca_lora_4bit. These are the settings I am using: ``` config = ExLlamaConfig(model_config_path) #...
transformers merged https://github.com/huggingface/transformers/pull/24653 only dynamic NTK RoPe scaling (NTKv2), would be nice to have it in exllama.
Add truncation warning, as it can be kind of rough to find this through trial and error or by noticing the context numbers.