exllama
exllama copied to clipboard
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Hello, `python test_benchmark_inference.py -d ./models -p -ppl` throws: ``` /bin/sh: 1: /usr/bin/nvcc: not found ninja: build stopped: subcommand failed. ``` but nvcc (Build cuda_11.8.r11.8) is installed ``` which nvcc /usr/local/cuda/bin/nvcc...
Hello. Thanks for this amazing work. How do I implement the multi-GPU inference using Ipython and not the WebUI? At present, I am implementing it this way. It is a...
First of all, thanks a lot for this great project! I got a weird issue when generating with llama 2 on 4096 context using `generator.generate_simple`, ``` File "/codebase/research/exllama/model.py", line 556,...
Configuration: AMD W7900 + Rocm5.6  Running the model on oobabooga/text-generation-webui, GPU Usage keeps even unload the model. Model: TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True Running meta-llama/Llama-2-7b-chat-hf without quantiziation would not have this issue. Is...
Hi, thanks for the cool project. I am testing [Llama-2-70B-GPTQ](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ) with 1 * A100 40G, the speed is around 9 t/s Is this the expected speed? I noticed in some...
Was wondering what the current progress was on the rewrite and if this could be turned into some sort of tracker for it? optimizations for the P40 seems to be...
As of now, there is no way to modify RoPE Frequency Base and RoPE Frequency Scale. We would need to edit `rope.cu` to support parameters for frequency and scale: https://github.com/turboderp/exllama/blob/21f4a12be5794692f66410ad4fb78ffaad508d00/exllama_ext/cuda_func/rope.cu#L21-L31...
Any thoughts/plans about YaRN support for the positional embeddings? https://github.com/jquesnelle/yarn I don't actually see them beat regular linear scaling w/ fine-tuning in the paper, but presumably it extends beyond the...
Hi, CodeLlama2 is not really working on exllama. The answers are sometimes complete gibbrish. Can you please upgrade the library to upgrade to the new rope_thea parameter of CodeLlama ?...
The following is sometimes happening while completion is ongoing for large context sizes. - My context size was: 3,262 - The max_new_tokens set: 4,096 ``` Traceback (most recent call last):...