text-generation-webui
text-generation-webui copied to clipboard
Add LLaMA to Colab
LLaMA runs in Colab just fine, including in 8bit. If the Colab is updated to include LLaMA, lots more people can experience LLaMA without needing to configure things locally.
Here's how I updated the Colab for LLaMA and how it could be updated by the maintainer going forward.
In Step 3 "Launch"
on Line 6 (added LLaMA entries, reduced pyg 6B to a single entry, default to LLaMA-13B):
model = "LLaMA-13B" #@param ["LLaMA-7B", "LLaMA-13B", "LLaMA-30B (Colab Pro)", "LLaMA-65B (Colab Pro)","Pygmalion 6B","Pygmalion 350m (for debugging)"] {allow-input: false}
on line 9 (default to True):
load_in_8bit = True #@param {type:"boolean"}
from line 17 (replaced models with LLaMA models, a single Pyg 6B, and Pyg 350m for debugging):
# Data
models = {
"LLaMA-7B": ("decapoda-research", "llama-7b-hf", "main","llama-7b-hf"),
"LLaMA-13B": ("decapoda-research", "llama-13b-hf", "main","llama-13b-hf"),
"LLaMA-30B (Colab Pro)": ("decapoda-research", "llama-30b-hf", "main","llama-30b-hf"),
"LLaMA-65B (Colab Pro)": ("decapoda-research", "llama-65b-hf", "main","llama-65b-hf"),
"Pygmalion 6B": ("waifu-workshop", "pygmalion-6b", "sharded", "pygmalion-6b_sharded"),
"Pygmalion 350m (for debugging)": ("PygmalionAI", "pygmalion-350m", "main", "pygmalion-350m"),
}
That's it. The Colab already works with LLaMA. It just needed to be pointed at where to find the models on HuggingFace.
Good Job! I also want to try running on Colab Pro
Good Job! I also want to try running on Colab Pro
You can make the changes above to the Colab linked in the readme. https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb
Just click "Show Code" in section 3 and then make the changes on the lines described in the OP.
You should be able to run as large as LLaMA-30B in 8bit with Colab Pro. (Note: LLaMA-13B ran at 0.6it/s. So 30B may be quite slow in Colab.)
LLaMA-65B 4bit should also work in Colab Pro, but 4bit requires a few more setup steps that are not in my post above.
I agree that it would be nice to have LLaMA in the colab notebook. In my experience, LLaMA surpasses and replaces Pygmalion.
I don't want to include an unlicensed model in the notebook though, so I will wait for Meta to properly release it on Hugging Face.
If I wanted to try and test this can I use the same settings in the colab that I would if I was running pygmalion or does anything need to be changed?
Nothing needs to be changed.
Use load_in_8bit depending on how much VRAM your instance has and what size model you use. Refer to the table below.
Model | 16bit VRAM Requirement | 8bit VRAM Requirement | 4bit VRAM Requirement |
---|---|---|---|
LLaMA-7B | 20GB | 10GB | 6GB** |
LLaMA-13B | 40GB | 16GB | 10GB** |
LLaMA-30B | 80GB | 32GB | 20GB** |
LLaMA-65B | 160GB | 80GB | 40GB** |
*RAM is only required to load the model, not to run it. Swap space can be used if you do not have enough RAM. **4bit VRAM requirements will go down substantially as optimizations like flash attention are implemented in GPTQ-for-LLaMA (Note: 16bit is faster than 8bit. Only use 8bit if it is needed to fit the model in VRAM.)
Ahh. I was trying to run the 13B model on non pro colab, so that's why I got busted. Thanks for the info!
I actually did get 13B to run in a free colab despite what the requirements table above says. It seems there are small efficiency improvements being made every day, allowing the model to fit in smaller amounts of VRAM.
With so little free VRAM you may have to keep the context window small (a few hundred tokens) or refresh and hope to get a bigger GPU.
I was able to run it, I just ran out of memory when I tried to generate. Probably token size like you said. I'll try again tomorrow, but the 7B wasn't that bad. There were a few hiccups but I'd like to say it was somewhat better than Pygmalion in using the information given to it. Only thing it seemed to have some struggle with is sticking to using one type of perspective, like mix of third person and pov instead of strictly what I was using, and sometimes it would get stuck using up the entire generation instead of finishing the sentence, but not a huge deal.
It works great, but I have problems when I try to get it to write code. this one is not displayed, and sometimes even bugs the display of the chatbox. It looks like a text escaping problem.
Hello I modded the colab and it worked great I was trying the 30b version but took > twenty minutes just to download the model and I was using a premium GPU runtime
I tried to !cp the files to my Gdrive when I was done but the Collab ran out of disk and I missed 10 of the "shards(?)" But GDrive itself is 2tb I got space. Even if it worked Im not sure I would of figured out how to load it next time
Is there anyway I can download directly to a mounted GDrive and load the model from it later instead of redownloading each time
@NoShinSekai it should work now if you instruct the bot to use markdown.
https://github.com/oobabooga/text-generation-webui/pull/266
Hi @oobabooga,
I saw a post on Twitter: LLAMA 7B "4bit" model working on Colab
Can you help me implement LLAMA 7B "4bit" on Colab? Thank you very much!
You pretty much just need to copy and paste the commands here
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode
And add another command to download the 4-bit preconverted model from decapoda.
Yes, Many thanks!
The colab notebook isn't working for me - instead of printing a Gradio URL the colab cell just finishes with the following output:
python server.py --share --gptq-bits 4 --model llama-13b-hf Loading llama-13b-hf... Loading model ... ^C
This is the code:
!pip uninstall transformers
!pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176
%cd /content
!git clone https://github.com/oobabooga/text-generation-webui
!mkdir text-generation-webui/logs
!ln -s text-generation-webui/logs .
!ln -s text-generation-webui/characters .
!ln -s text-generation-webui/models .
%rm -r sample_data
%cd text-generation-webui
!pip install -r requirements.txt
!mkdir repositories
%cd repositories
!git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
%cd GPTQ-for-LLaMa
!pip install -r requirements.txt
!python setup_cuda.py install
%cd ../..
model = "llama-13b" #@param ["llama-13b", "llama-7b"]
model_dir = model + "-hf"
!python download-model.py --text-only decapoda-research/$model_dir
%cd models/$model_dir
!wget https://huggingface.co/decapoda-research/$model-hf-int4/resolve/main/$model-4bit.pt
%cd ../..
import json
cmd = f"python server.py --share --gptq-bits 4 --model {model_dir}"
print(cmd)
!$cmd
What's going wrong? Everything works fine if I load in 8 bit.
@oobabooga : Thanks for the tips for Markdown formatting! the new UI is even better!
@generatorman : I'm using the following code to use the 13b model in 4bits (I didn't install the Llama transformer, but it works great like this):
Step 1 : install GPTQ-for-LLaMa
%cd /content/text-generation-webui/
!mkdir repositories
%cd /content/text-generation-webui/repositories/
!git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
%cd GPTQ-for-LLaMa
!python setup_cuda.py install
Step 2 : Download the llama-13b model and launch the Web UI
%cd /content/text-generation-webui/
!python download-model.py --text-only decapoda-research/llama-13b-hf
!wget https://huggingface.co/decapoda-research/llama-13b-hf-int4/resolve/main/llama-13b-4bit.pt -P /content/models/llama-13b-hf/
!python server.py --share --model llama-13b-hf --gptq-bits 4
Doing the same thing as previously (changing the lines from first post,) now I get this error. Any idea what to do?
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2h1afqm5qj3gi --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100% 33/33 [01:07<00:00, 2.05s/it]
Traceback (most recent call last):
File "/content/text-generation-webui/server.py", line 215, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/content/text-generation-webui/modules/models.py", line 158, in load_model
tokenizer = AutoTokenizer.from_pretrained(Path(f"models/{shared.model_name}/"))
File "/usr/local/lib/python3.9/dist-packages/transformers/models/auto/tokenization_auto.py", line 677, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.
https://github.com/oobabooga/text-generation-webui/commit/23a5e886e1aa6849e0819256c3bb4b2bf7d8358e
This is why.
@Enferlain You need to edit the file tokenizer_config.json in /models/llama-7b-hf/ or whatever models you are using, and change the string "LLaMaTokenizer" to "LlamaTokenizer". I got the same error a few hours ago.
Thanks, works. Oh and yeah I wanted to try the 13b model with the instructions you wrote, but I only have 12.7GB ram on colab which seems to bust at the loading part
You need Colab Pro to use the enhanced RAM environment when loading the model. it's take something like 15~16Gb of ram.
13B in 8bit loaded fine for me without Pro and never used more than 3GB of RAM during loading.
The VRAM could not fit the full 2048 context, but it loaded and ran fine.
Repository Not Found for url: https://huggingface.co/models/llama-7b-hf/resolve/main/config.json
Did they change something 🤔
@NoShinSekai
Any idea why I get this error when I try to load 4bit models?
Loading llama-13b-hf-int4...
Could not find the quantized model in .pt or .safetensors format, exiting...
I tried 3 different ones
https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ i think the original 4bit models aren't working anymore this thread suggest grabbing them from torrents they provide
also GPTQ is needed for 4bit and that main repo isnt working right and was replaced with oobabooga's fork for the moment "git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda"
I tried again and now I'm not getting that error, I must have missed something. Struggling to load a model in though
This is the setup
I tried these models so far
https://huggingface.co/wcde/llama-13b-4bit-gr128 - system ram spiked up to 11-12gb and died with ^C https://huggingface.co/elinas/alpaca-13b-lora-int4 - this one works pretty well https://huggingface.co/decapoda-research/llama-13b-hf-int4 - error doesn't have a config.json
I might try uploading some other models through drive or look around on huggingface for some more, thanks for the comment since it made me try again.
On another note, gradio is being so slow holy shit
Gradio 3.24.0 seems to be really unstable on Colab for some reason. You can try adding pip install gradio==3.18.0
after pip install -r requirements.txt
.
https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb
I just run it on colab and causing error at the last
OSError: models/pygmalion-6b_original-sharded is not a local folder and is not a
valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to
this repo with use_auth_token
or log in with huggingface-cli login
and pass
use_auth_token=True
.
maybe it's because I didn't put model file under model folder?
then should I go to below link(once I say I want to use pygmalion 6b model. https://huggingface.co/PygmalionAI/pygmalion-6b/tree/main
then download file and put bin file to model? don't know what file should I put
https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb
I just run it on colab and causing error at the last
OSError: models/pygmalion-6b_original-sharded is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with
use_auth_token
or log in withhuggingface-cli login
and passuse_auth_token=True
.maybe it's because I didn't put model file under model folder?
then should I go to below link(once I say I want to use pygmalion 6b model. https://huggingface.co/PygmalionAI/pygmalion-6b/tree/main
then download file and put bin file to model? don't know what file should I put
I encountered the same issue and dug into the error and the source code.Before the OSError, there was a KeyError in text-generation-webui/download-model.py. I opened the file in colab and did these,
- removed the while loop in line 106
- removed if statement in line 111
- commented out the three cursor = lines in from line 140
- Unindented the nested code
- Replace the line 106 where content = with
content = requests.get(f"
https://huggingface.co/api/models/decapoda-research/llama-7b-hf/tree/main
").content
- Rename the dict variable to something else (maybe dictt)
And that's it, this should work and the llama model should get downloaded in the models folder
Good Job! I also want to try running on Colab Pro
You can make the changes above to the Colab linked in the readme. https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb
Just click "Show Code" in section 3 and then make the changes on the lines described in the OP.
You should be able to run as large as LLaMA-30B in 8bit with Colab Pro. (Note: LLaMA-13B ran at 0.6it/s. So 30B may be quite slow in Colab.)
LLaMA-65B 4bit should also work in Colab Pro, but 4bit requires a few more setup steps that are not in my post above.
Traceback (most recent call last):
File "/content/text-generation-webui/download-model.py", line 169, in
how to configure it to run the "sharded" version of Mistral-7b to allow using a free T4 in Colab? example https://huggingface.co/bn22/Mistral-7B-Instruct-v0.1-sharded/tree/main