exllama issues

doesn't use CUDA_HOME?

Hello, `python test_benchmark_inference.py -d ./models -p -ppl` throws: ``` /bin/sh: 1: /usr/bin/nvcc: not found ninja: build stopped: subcommand failed. ``` but nvcc (Build cuda_11.8.r11.8) is installed ``` which nvcc /usr/local/cuda/bin/nvcc...

j2l

GPU Inference from IPython

Hello. Thanks for this amazing work. How do I implement the multi-GPU inference using Ipython and not the WebUI? At present, I am implementing it this way. It is a...

Rajmehta123

Weird issue with context length

6

First of all, thanks a lot for this great project! I got a weird issue when generating with llama 2 on 4096 context using `generator.generate_simple`, ``` File "/codebase/research/exllama/model.py", line 556,...

zac-wang-nv

GPU Usage Keeps High Even Without Inference Load

7

Configuration: AMD W7900 + Rocm5.6 ![image](https://github.com/turboderp/exllama/assets/36061851/8e7639ec-8938-4991-bc97-3c0ce3f92638) Running the model on oobabooga/text-generation-webui, GPU Usage keeps even unload the model. Model: TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True Running meta-llama/Llama-2-7b-chat-hf without quantiziation would not have this issue. Is...

leonxia1018

Speed on A100

4

Hi, thanks for the cool project. I am testing [Llama-2-70B-GPTQ](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ) with 1 * A100 40G, the speed is around 9 t/s Is this the expected speed? I noticed in some...

Ber666

Progress on the rewrite for older cards (Like the P40)

1

Was wondering what the current progress was on the rewrite and if this could be turned into some sort of tracker for it? optimizations for the P40 seems to be...

TimyIsCool

RoPE Frequency Base and Frequency Scale Support

3

As of now, there is no way to modify RoPE Frequency Base and RoPE Frequency Scale. We would need to edit `rope.cu` to support parameters for frequency and scale: https://github.com/turboderp/exllama/blob/21f4a12be5794692f66410ad4fb78ffaad508d00/exllama_ext/cuda_func/rope.cu#L21-L31...

ChrisCates

YaRN Support

8

Any thoughts/plans about YaRN support for the positional embeddings? https://github.com/jquesnelle/yarn I don't actually see them beat regular linear scaling w/ fine-tuning in the paper, but presumably it extends beyond the...

grimulkan

Codelama support

11

Hi, CodeLlama2 is not really working on exllama. The answers are sometimes complete gibbrish. Can you please upgrade the library to upgrade to the new rope_thea parameter of CodeLlama ?...

ParisNeo

Completion abruptly stopped - RuntimeError: CUDA error: an illegal memory access was encountered

1

The following is sometimes happening while completion is ongoing for large context sizes. - My context size was: 3,262 - The max_new_tokens set: 4,096 ``` Traceback (most recent call last):...

Thireus

exllama
exllama copied to clipboard

Metadata

doesn't use CUDA_HOME?

GPU Inference from IPython

Weird issue with context length

GPU Usage Keeps High Even Without Inference Load

Speed on A100

Progress on the rewrite for older cards (Like the P40)

RoPE Frequency Base and Frequency Scale Support

YaRN Support

Codelama support

Completion abruptly stopped - RuntimeError: CUDA error: an illegal memory access was encountered

← Metadata

Owner

Metadata

exllama exllama copied to clipboard

Metadata

← Metadata

Owner

Metadata

exllama
exllama copied to clipboard