exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Exllama tutorials?

Open NickDatLe opened this issue 2 years ago • 23 comments

I'm new to exllama, are there any tutorials on how to use this? I'm trying this with the llama-2 70b model.

NickDatLe avatar Jul 25 '23 03:07 NickDatLe

There is no specific tutorial but here is how to set it up and get it running! (note: for the 70B model you need at least 42GB VRAM, so a single A6000 / 6000 Ada or two 3090/4090s can only run the model, see the README for speed stats on a mixture of GPUs)

To begin with, you want to install conda install script and then create a new conda environment (so that pip packages don't mix with other py projects)

conda create -n exllama python=3.10
# after that
conda activate exllama

Then, clone the repo

git clone https://github.com/turboderp/exllama
cd exllama

# while conda is activated
pip install -r requirements.txt

You want to download a GPTQ quantized model. TheBloke provides lots of them that all work.

# if you don't have git lfs installed: sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

You're all set. Now, the only thing left is running a test benchmark and finally running the chatbot example.

python test_benchmark_inference.py -p -ppl -d ../path/to/Llama-2-70B-chat-GPTQ/ -gs 16.2,24
# add -gs 16.2,24 when running models that require more VRAM than one GPU can supply

If that is successful, run this and enjoy a chatbot:

python example_chatbot.py -d ../path/to/Llama-2-70B-chat-GPTQ/ -un NickDatLe -bn ChadGPT -p prompt_chatbort.txt -nnl
# -nnl makes it so that the bot can output more than one line

Et voila. Edit the prompt_chatbort.txt inside the exllama repo as you like. Keep in mind, the Llama 2 chat format is different than the one the example provides, I am working on implementing the real prompt into example_chatbot_llama2chat.py and will do a PR soon.

SinanAkkoyun avatar Jul 25 '23 09:07 SinanAkkoyun

Thank you for your help Sinan! I followed your instructions to:

python test_benchmark_inference.py -p -ppl -d ../path/to/Llama-2-70B-chat-GPTQ/

I have 2x 4090 GPU, it's only using one of them as far as I can tell, and so I'm getting a cuda out of memory error:

(py311) nick@easyai:~/dev/exllama$ python test_benchmark_inference.py -p -ppl -d Llama-2-70B-chat-GPTQ -- Perplexity: -- - Dataset: datasets/wikitext2_val_sample.jsonl -- - Chunks: 100 -- - Chunk size: 2048 -> 2048 -- - Chunk overlap: 0 -- - Min. chunk size: 50 -- - Key: text -- Tokenizer: Llama-2-70B-chat-GPTQ/tokenizer.model -- Model config: Llama-2-70B-chat-GPTQ/config.json -- Model: Llama-2-70B-chat-GPTQ/gptq_model-4bit--1g.safetensors -- Sequence length: 2048 -- Tuning: -- --sdp_thd: 8 -- --matmul_recons_thd: 8 -- --fused_mlp_thd: 2 -- Options: ['perf', 'perplexity'] Traceback (most recent call last): File "/home/nick/dev/exllama/test_benchmark_inference.py", line 125, in model = timer("Load model", lambda: ExLlama(config)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nick/dev/exllama/test_benchmark_inference.py", line 56, in timer ret = func() ^^^^^^ File "/home/nick/dev/exllama/test_benchmark_inference.py", line 125, in model = timer("Load model", lambda: ExLlama(config)) ^^^^^^^^^^^^^^^ File "/home/nick/dev/exllama/model.py", line 831, in init tensor = tensor.to(device, non_blocking = True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 23.64 GiB total capacity; 23.23 GiB already allocated; 23.88 MiB free; 23.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

NickDatLe avatar Jul 25 '23 21:07 NickDatLe

You need to define how weights are to be split across the GPUs. There's a bit of trial and error in that, currently, since you're only supplying the maximum allocation for weights, not activations. And space needed for activations is a difficult function of exactly what layers end up on the device, so best you can do for now is just try some values and adjust based on which GPU ends up running out of memory first. The syntax is just -gs 16.2,24 to use up to 16.2 GB on the first device, then up to 24 GB on the second. I find that works pretty well on 70B, but YMMV especially with lower group sizes.

turboderp avatar Jul 25 '23 22:07 turboderp

I didn't see the -gs flag, after setting it to 16.2,24 like you mentioned, it worked. Thank you!

-- Perplexity: -- - Dataset: datasets/wikitext2_val_sample.jsonl -- - Chunks: 100 -- - Chunk size: 2048 -> 2048 -- - Chunk overlap: 0 -- - Min. chunk size: 50 -- - Key: text -- Tokenizer: Llama-2-70B-chat-GPTQ/tokenizer.model -- Model config: Llama-2-70B-chat-GPTQ/config.json -- Model: Llama-2-70B-chat-GPTQ/gptq_model-4bit--1g.safetensors -- Sequence length: 2048 -- Tuning: -- --sdp_thd: 8 -- --matmul_recons_thd: 8 -- --fused_mlp_thd: 2 -- Options: ['gpu_split: 16.2,24', 'perf', 'perplexity'] ** Time, Load model: 4.34 seconds ** Time, Load tokenizer: 0.01 seconds -- Groupsize (inferred): None -- Act-order (inferred): no !! Model has empty group index (discarded) ** VRAM, Model: [cuda:0] 16,902.60 MB - [cuda:1] 17,390.74 MB ** VRAM, Cache: [cuda:0] 308.12 MB - [cuda:1] 320.00 MB -- Warmup pass 1... ** Time, Warmup: 1.33 seconds -- Warmup pass 2... ** Time, Warmup: 0.81 seconds -- Inference, first pass. ** Time, Inference: 1.24 seconds ** Speed: 1545.28 tokens/second -- Generating 128 tokens, 1920 token prompt... ** Speed: 17.07 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 21.84 tokens/second ** VRAM, Inference: [cuda:0] 317.16 MB - [cuda:1] 317.29 MB ** VRAM, Total: [cuda:0] 17,527.88 MB - [cuda:1] 18,028.04 MB -- Loading dataset... -- Testing 100 chunks.......... ** Perplexity: 5.8741

NickDatLe avatar Jul 25 '23 22:07 NickDatLe

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

Sorry, I forgot about that! I edited it now

SinanAkkoyun avatar Jul 25 '23 22:07 SinanAkkoyun

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

Sorry, I forgot about that! I edited it now

All good, working now; I'm going to learn exllama more. Fascinating stuff!

NickDatLe avatar Jul 25 '23 22:07 NickDatLe

If it's ok with the mods, I'm going to leave this thread open in case someone posts a tutorial or have some great links to exllama.

NickDatLe avatar Jul 25 '23 23:07 NickDatLe

@SinanAkkoyun do you know what folks in the LLM community are using to communicate? Discord? Slack?

NickDatLe avatar Jul 26 '23 04:07 NickDatLe

@NickDatLe Most that I know use Discord, however very decentralized over many servers

SinanAkkoyun avatar Jul 26 '23 08:07 SinanAkkoyun

@NickDatLe Most that I know use Discord, however very decentralized over many servers

Ahh ok, I will join some discord servers. It seems "TheBloke" has a server and that person is very popular on the LLM leaderboard.

NickDatLe avatar Jul 27 '23 23:07 NickDatLe

Where is this? Invite me!

Edit: Never mind I found it.

turboderp avatar Jul 28 '23 01:07 turboderp

@SinanAkkoyun Once the test_benchmark_inference.py script has finished successfully, is there an easy way to get the 70b chatbot running in a jupyter notebook?

Edit: For posterity, it was relatively straightforward to work in a notebook environment by adapting code from the example_basic.py file.

cmunna0052 avatar Jul 31 '23 19:07 cmunna0052

Et voila. Edit the prompt_chatbort.txt inside the exllama repo as you like. Keep in mind, the Llama 2 chat format is different than the one the example provides, I am working on implementing the real prompt into example_chatbot_llama2chat.py and will do a PR soon.

Can you share "example_chatbot_llama2chat.py" if it is possible?

pourfard avatar Aug 03 '23 18:08 pourfard

@pourfard a PR is incoming today, I will implement it

SinanAkkoyun avatar Aug 04 '23 10:08 SinanAkkoyun

@pourfard https://github.com/turboderp/exllama/pull/221

:) Either wait for the PR to be merged or copy the new file example_llama2chat.py directly into your Exllama directory. (Keep in mind, you need the latest version of Exllama)

SinanAkkoyun avatar Aug 04 '23 13:08 SinanAkkoyun

First of all Thank You, exllama's working for me, while others do not...

I did some testing of a number of models with Sally riddle... https://github.com/nktice/AMD-AI/blob/main/SallyAIRiddle.md [ and here's my setup in case it's of benefit to other people - https://github.com/nktice/AMD-AI ] I did this by hand, through Oobabooga UI, and it took me a while.

I'd like commands to run exllama from shell scripts ( such as a bash shell ). So I went looking and was disappointed by the python files there... I had hoped to find them respond with help info from the command line. "--help" or "-h" could return parameters and program purpose. [ As this thread was for documentation issues, seems like this would help. ]

nktice avatar Aug 12 '23 14:08 nktice

send invite/link please!

NickDatLe avatar Aug 17 '23 23:08 NickDatLe

https://discord.gg/theblokeai

Really awesome

SinanAkkoyun avatar Aug 18 '23 03:08 SinanAkkoyun

https://discord.gg/theblokeai

Really awesome

Add me! nickdle

NickDatLe avatar Aug 25 '23 04:08 NickDatLe

Add me! nickdle

You need to join by clicking the link :)

SinanAkkoyun avatar Aug 25 '23 17:08 SinanAkkoyun

I invited you as a friend :)

NickDatLe avatar Aug 27 '23 21:08 NickDatLe

@NickDatLe Oh you mean that, sure! I can't find you in my friend requests, what is your tag? :)

SinanAkkoyun avatar Aug 28 '23 04:08 SinanAkkoyun

@NickDatLe Oh you mean that, sure! I can't find you in my friend requests, what is your tag? :)

nickdle, I sent a friend request.

NickDatLe avatar Aug 29 '23 15:08 NickDatLe