text-generation-webui Add LLaMA to Colab

LLaMA runs in Colab just fine, including in 8bit. If the Colab is updated to include LLaMA, lots more people can experience LLaMA without needing to configure things locally.

Here's how I updated the Colab for LLaMA and how it could be updated by the maintainer going forward.

In Step 3 "Launch"

on Line 6 (added LLaMA entries, reduced pyg 6B to a single entry, default to LLaMA-13B):

model = "LLaMA-13B" #@param ["LLaMA-7B", "LLaMA-13B", "LLaMA-30B (Colab Pro)", "LLaMA-65B (Colab Pro)","Pygmalion 6B","Pygmalion 350m (for debugging)"] {allow-input: false}

on line 9 (default to True):

load_in_8bit = True #@param {type:"boolean"}

from line 17 (replaced models with LLaMA models, a single Pyg 6B, and Pyg 350m for debugging):

# Data
models = {
    "LLaMA-7B": ("decapoda-research", "llama-7b-hf", "main","llama-7b-hf"),
    "LLaMA-13B": ("decapoda-research", "llama-13b-hf", "main","llama-13b-hf"),
    "LLaMA-30B (Colab Pro)": ("decapoda-research", "llama-30b-hf", "main","llama-30b-hf"),
    "LLaMA-65B (Colab Pro)": ("decapoda-research", "llama-65b-hf", "main","llama-65b-hf"),
    "Pygmalion 6B": ("waifu-workshop", "pygmalion-6b", "sharded", "pygmalion-6b_sharded"),
    "Pygmalion 350m (for debugging)": ("PygmalionAI", "pygmalion-350m", "main", "pygmalion-350m"),
}

That's it. The Colab already works with LLaMA. It just needed to be pointed at where to find the models on HuggingFace.

Mar 10 '23 10:03 MarkSchmidty

Good Job! I also want to try running on Colab Pro

Mar 10 '23 13:03 SlimeVRX

Good Job! I also want to try running on Colab Pro

You can make the changes above to the Colab linked in the readme. https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

Just click "Show Code" in section 3 and then make the changes on the lines described in the OP.

You should be able to run as large as LLaMA-30B in 8bit with Colab Pro. (Note: LLaMA-13B ran at 0.6it/s. So 30B may be quite slow in Colab.)

LLaMA-65B 4bit should also work in Colab Pro, but 4bit requires a few more setup steps that are not in my post above.

Mar 10 '23 19:03 MarkSchmidty

I agree that it would be nice to have LLaMA in the colab notebook. In my experience, LLaMA surpasses and replaces Pygmalion.

I don't want to include an unlicensed model in the notebook though, so I will wait for Meta to properly release it on Hugging Face.

Mar 10 '23 19:03 oobabooga

If I wanted to try and test this can I use the same settings in the colab that I would if I was running pygmalion or does anything need to be changed?

Mar 10 '23 22:03 Enferlain

Nothing needs to be changed.

Use load_in_8bit depending on how much VRAM your instance has and what size model you use. Refer to the table below.

Model	16bit VRAM Requirement	8bit VRAM Requirement	4bit VRAM Requirement
LLaMA-7B	20GB	10GB	6GB**
LLaMA-13B	40GB	16GB	10GB**
LLaMA-30B	80GB	32GB	20GB**
LLaMA-65B	160GB	80GB	40GB**

*RAM is only required to load the model, not to run it. Swap space can be used if you do not have enough RAM. **4bit VRAM requirements will go down substantially as optimizations like flash attention are implemented in GPTQ-for-LLaMA (Note: 16bit is faster than 8bit. Only use 8bit if it is needed to fit the model in VRAM.)

Mar 10 '23 23:03 MarkSchmidty

Ahh. I was trying to run the 13B model on non pro colab, so that's why I got busted. Thanks for the info!

Mar 10 '23 23:03 Enferlain

I actually did get 13B to run in a free colab despite what the requirements table above says. It seems there are small efficiency improvements being made every day, allowing the model to fit in smaller amounts of VRAM.

With so little free VRAM you may have to keep the context window small (a few hundred tokens) or refresh and hope to get a bigger GPU.

Mar 11 '23 03:03 MarkSchmidty

I was able to run it, I just ran out of memory when I tried to generate. Probably token size like you said. I'll try again tomorrow, but the 7B wasn't that bad. There were a few hiccups but I'd like to say it was somewhat better than Pygmalion in using the information given to it. Only thing it seemed to have some struggle with is sticking to using one type of perspective, like mix of third person and pov instead of strictly what I was using, and sometimes it would get stuck using up the entire generation instead of finishing the sentence, but not a huge deal.

Mar 11 '23 03:03 Enferlain

It works great, but I have problems when I try to get it to write code. this one is not displayed, and sometimes even bugs the display of the chatbox. It looks like a text escaping problem.

Mar 15 '23 00:03 NoShinSekai

Hello I modded the colab and it worked great I was trying the 30b version but took > twenty minutes just to download the model and I was using a premium GPU runtime

I tried to !cp the files to my Gdrive when I was done but the Collab ran out of disk and I missed 10 of the "shards(?)" But GDrive itself is 2tb I got space. Even if it worked Im not sure I would of figured out how to load it next time

Is there anyway I can download directly to a mounted GDrive and load the model from it later instead of redownloading each time

Mar 15 '23 19:03 pcrii

@NoShinSekai it should work now if you instruct the bot to use markdown.

https://github.com/oobabooga/text-generation-webui/pull/266

Mar 15 '23 19:03 oobabooga

Hi @oobabooga,

I saw a post on Twitter: LLAMA 7B "4bit" model working on Colab

FrO8rX0aYAAD4op

Can you help me implement LLAMA 7B "4bit" on Colab? Thank you very much!

Mar 16 '23 02:03 SlimeVRX

You pretty much just need to copy and paste the commands here

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode

And add another command to download the 4-bit preconverted model from decapoda.

Mar 16 '23 02:03 oobabooga

Yes, Many thanks!

Mar 16 '23 03:03 SlimeVRX

The colab notebook isn't working for me - instead of printing a Gradio URL the colab cell just finishes with the following output:

python server.py --share --gptq-bits 4 --model llama-13b-hf Loading llama-13b-hf... Loading model ... ^C

This is the code:

!pip uninstall transformers
!pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

%cd /content
!git clone https://github.com/oobabooga/text-generation-webui

!mkdir text-generation-webui/logs

!ln -s text-generation-webui/logs .
!ln -s text-generation-webui/characters .
!ln -s text-generation-webui/models .
%rm -r sample_data
%cd text-generation-webui
!pip install -r requirements.txt

!mkdir repositories
%cd repositories
!git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
%cd GPTQ-for-LLaMa
!pip install -r requirements.txt
!python setup_cuda.py install
%cd ../..

model = "llama-13b" #@param ["llama-13b", "llama-7b"]
model_dir = model + "-hf"

!python download-model.py --text-only decapoda-research/$model_dir
%cd models/$model_dir
!wget https://huggingface.co/decapoda-research/$model-hf-int4/resolve/main/$model-4bit.pt
%cd ../..

import json

cmd = f"python server.py --share --gptq-bits 4 --model {model_dir}"
print(cmd)
!$cmd

What's going wrong? Everything works fine if I load in 8 bit.

Mar 16 '23 07:03 generatorman

@oobabooga : Thanks for the tips for Markdown formatting! the new UI is even better!

@generatorman : I'm using the following code to use the 13b model in 4bits (I didn't install the Llama transformer, but it works great like this):

Step 1 : install GPTQ-for-LLaMa

%cd /content/text-generation-webui/
!mkdir repositories
%cd /content/text-generation-webui/repositories/
!git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
%cd GPTQ-for-LLaMa
!python setup_cuda.py install

Step 2 : Download the llama-13b model and launch the Web UI

%cd /content/text-generation-webui/
!python download-model.py --text-only decapoda-research/llama-13b-hf
!wget https://huggingface.co/decapoda-research/llama-13b-hf-int4/resolve/main/llama-13b-4bit.pt -P /content/models/llama-13b-hf/
!python server.py --share --model llama-13b-hf --gptq-bits 4

Mar 16 '23 13:03 NoShinSekai

Doing the same thing as previously (changing the lines from first post,) now I get this error. Any idea what to do?

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2h1afqm5qj3gi --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100% 33/33 [01:07<00:00,  2.05s/it]
Traceback (most recent call last):
  File "/content/text-generation-webui/server.py", line 215, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/content/text-generation-webui/modules/models.py", line 158, in load_model
    tokenizer = AutoTokenizer.from_pretrained(Path(f"models/{shared.model_name}/"))
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/auto/tokenization_auto.py", line 677, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

https://github.com/oobabooga/text-generation-webui/commit/23a5e886e1aa6849e0819256c3bb4b2bf7d8358e

This is why.

Mar 17 '23 01:03 Enferlain

@Enferlain You need to edit the file tokenizer_config.json in /models/llama-7b-hf/ or whatever models you are using, and change the string "LLaMaTokenizer" to "LlamaTokenizer". I got the same error a few hours ago.

Mar 17 '23 01:03 NoShinSekai

Thanks, works. Oh and yeah I wanted to try the 13b model with the instructions you wrote, but I only have 12.7GB ram on colab which seems to bust at the loading part

Mar 17 '23 01:03 Enferlain

You need Colab Pro to use the enhanced RAM environment when loading the model. it's take something like 15~16Gb of ram.

Mar 17 '23 02:03 NoShinSekai

13B in 8bit loaded fine for me without Pro and never used more than 3GB of RAM during loading.

The VRAM could not fit the full 2048 context, but it loaded and ran fine.

Mar 17 '23 17:03 MarkSchmidty

Repository Not Found for url: https://huggingface.co/models/llama-7b-hf/resolve/main/config.json

Did they change something 🤔

Apr 02 '23 22:04 Enferlain

@NoShinSekai

Any idea why I get this error when I try to load 4bit models?

Loading llama-13b-hf-int4...
Could not find the quantized model in .pt or .safetensors format, exiting...

I tried 3 different ones

Apr 03 '23 15:04 Enferlain

https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ i think the original 4bit models aren't working anymore this thread suggest grabbing them from torrents they provide

also GPTQ is needed for 4bit and that main repo isnt working right and was replaced with oobabooga's fork for the moment "git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda"

Apr 03 '23 15:04 pcrii

I tried again and now I'm not getting that error, I must have missed something. Struggling to load a model in though

This is the setup

I tried these models so far

https://huggingface.co/wcde/llama-13b-4bit-gr128 - system ram spiked up to 11-12gb and died with ^C https://huggingface.co/elinas/alpaca-13b-lora-int4 - this one works pretty well https://huggingface.co/decapoda-research/llama-13b-hf-int4 - error doesn't have a config.json

I might try uploading some other models through drive or look around on huggingface for some more, thanks for the comment since it made me try again.

On another note, gradio is being so slow holy shit

Apr 03 '23 16:04 Enferlain

Gradio 3.24.0 seems to be really unstable on Colab for some reason. You can try adding pip install gradio==3.18.0 after pip install -r requirements.txt.

Apr 03 '23 16:04 oobabooga

https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

I just run it on colab and causing error at the last

OSError: models/pygmalion-6b_original-sharded is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True.

maybe it's because I didn't put model file under model folder? 스크린샷 2023-05-17 오전 1 47 21

then should I go to below link(once I say I want to use pygmalion 6b model. https://huggingface.co/PygmalionAI/pygmalion-6b/tree/main

then download file and put bin file to model? don't know what file should I put

May 16 '23 16:05 kotran88

https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

I just run it on colab and causing error at the last

OSError: models/pygmalion-6b_original-sharded is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True.

maybe it's because I didn't put model file under model folder?

then should I go to below link(once I say I want to use pygmalion 6b model. https://huggingface.co/PygmalionAI/pygmalion-6b/tree/main

then download file and put bin file to model? don't know what file should I put

I encountered the same issue and dug into the error and the source code.Before the OSError, there was a KeyError in text-generation-webui/download-model.py. I opened the file in colab and did these,

removed the while loop in line 106
removed if statement in line 111
commented out the three cursor = lines in from line 140
Unindented the nested code
Replace the line 106 where content = with content = requests.get(f"https://huggingface.co/api/models/decapoda-research/llama-7b-hf/tree/main").content
Rename the dict variable to something else (maybe dictt)

And that's it, this should work and the llama model should get downloaded in the models folder

May 25 '23 08:05 lokeshn011101

Good Job! I also want to try running on Colab Pro

You can make the changes above to the Colab linked in the readme. https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

Just click "Show Code" in section 3 and then make the changes on the lines described in the OP.

You should be able to run as large as LLaMA-30B in 8bit with Colab Pro. (Note: LLaMA-13B ran at 0.6it/s. So 30B may be quite slow in Colab.)

LLaMA-65B 4bit should also work in Colab Pro, but 4bit requires a few more setup steps that are not in my post above.

Traceback (most recent call last): File "/content/text-generation-webui/download-model.py", line 169, in links, is_lora = get_download_links_from_huggingface(model, branch) File "/content/text-generation-webui/download-model.py", line 113, in get_download_links_from_huggingface fname = dict[i]['path'] KeyError: 0 python server.py --share --model pygmalion-6b_sharded --settings settings-colab.json --no-stream --extensions gallery --chat Traceback (most recent call last): File "/content/text-generation-webui/server.py", line 9, in import gradio as gr ModuleNotFoundError: No module named 'gradio'

Jun 17 '23 12:06 Vinitrajputt

how to configure it to run the "sharded" version of Mistral-7b to allow using a free T4 in Colab? example https://huggingface.co/bn22/Mistral-7B-Instruct-v0.1-sharded/tree/main

Oct 24 '23 12:10 HAL9KKK

text-generation-webui text-generation-webui copied to clipboard

Add LLaMA to Colab

In Step 3 "Launch"

text-generation-webui
text-generation-webui copied to clipboard