llama
llama copied to clipboard
7B model CUDA out of memory on rtx3090ti 24Gb
i have seen someone in this issues Message area said that 7B model just needs 8.5G VRAM. but why i ran the example.py returns out of memory on a 24G VRAM cards? any help will be appreciated! Thanks!
@Jehuty-ML might have to do with their recent update to the sequence length (1024 to 2048). Also, try changing the batch size to 2 and reduce the example prompts to an array of size two in example.py
.
Just to test things out, try a previous commit to restore the sequence length. If you are interested in distributing the model on multiple machines, try https://github.com/modular-ml/wrapyfi-examples_llama
Use this branch: https://github.com/tloen/llama-int8.git
I'm able to run it on my 1080Ti with 2048 in and 2048 out tokens, you might be able to run a 14B model instead.
(If the calculation blows up, you might have to change/add threshold value to the Linear8bit layer to 2.0 for example.)
24G VRAM is more than enough for the 7B model. I ran it with just 12GB RAM and 16GB VRAM. See: #105 You want to set the batch size to 1. (To clarify the 7B model will need about 14GB VRAM. This is because each weight takes 2 bytes each)
For the int8 version, it may take less VRAM but (correct me I'm wrong) it takes a great deal of RAM in order to do the conversion. And after all that it is probably slower. I will try the int8 version later to see what all the fuss is about 😀
BTW, people are now trying 3/4bits weights. That might be even more promising:
https://github.com/huggingface/transformers/pull/21955#issuecomment-1456684556
Adjusting your batch size and reducing the number of prompts will definitely help. It worked for me and I'm on 4x V-100 GPUs. One thing I couldn't wrap my head around is the UnicodeEncodeError I keep seeing while trying to run example.py.
@Jehuty-ML might have to do with their recent update to the sequence length (1024 to 2048). Also, try changing the batch size to 2 and reduce the example prompts to an array of size two in
example.py
.Just to test things out, try a previous commit to restore the sequence length. If you are interested in distributing the model on multiple machines, try https://github.com/modular-ml/wrapyfi-examples_llama
Someone tried with RTX4090 which is 24GB ? Can we fit the 13B model on the RTX 4090 ?
i have seen someone in this issues Message area said that 7B model just needs 8.5G VRAM. but why i ran the example.py returns out of memory on a 24G VRAM cards? any help will be appreciated! Thanks!
It is in the FAQ. Change max_seq_len to 512 and it should work with defaults on a 24GB card.
man, I have 8GB NVIDIA TX 2080 and can't run it locally, even the 7B model. Options are basically to try and get a Cloud VM I suppose. Sigh.
What's crazy is that I just went through all the hoops to get a working Python version with torch (I had 3.6 before and had to up to 3.9 to get it all running and working, and then boom, run it, OoM, then I found this great thread on Google, reduced batch size to 1, and then down to 1 phrase, and still doesn't run. Arrrgh
@Jehuty-ML try https://github.com/facebookresearch/llama/issues/166
I also run smoothly with a card with 24 VRAM
update, I was able to run the CPU version on my Mac but it took 35 seconds to load the model and 30+ minutes to run the example, but at least I was able to get it to run!
@chrismattmann Why not try out the new 4bit fork? https://github.com/qwopqwop200/GPTQ-for-LLaMa
@chrismattmann Why not try out the new 4bit fork? https://github.com/qwopqwop200/GPTQ-for-LLaMa
Thank you I will give this a shot. I have 32Gb on my Kubuntu Focus Gen 2, with NVIDIA TX2080 8Gb RAM in it, so I should be able to use this I guess on the 7B parameter model, right? It's unclear to me the exact steps from reading the README, but I'll look at it in more detail soon. Thank you @archytasos !
It's unclear to me the exact steps from reading the README
I was able to get the 4bit version kind of working on 8G 2060 SUPER (still OOM occasionally shrug but mostly works) but you're right the steps are quite unclear. I had to use a specific CUDA version (11.7) compile everything myself (esp the CUDA kernel with python setup_cuda.py build
python setup_cuda.py install
).
It's unclear to me the exact steps from reading the README
I was able to get the 4bit version kind of working on 8G 2060 SUPER (still OOM occasionally shrug but mostly works) but you're right the steps are quite unclear. I had to use a specific CUDA version (11.7) compile everything myself (esp the CUDA kernel with
python setup_cuda.py build
python setup_cuda.py install
).
Would you be willing to post a Gist or provide the steps you ran beyond those 2 Python commands? Should I run it on a CPU only machine or do I need a GPU?
Would you be willing to post a Gist or provide the steps you ran beyond those 2 Python commands? Should I run it on a CPU only machine or do I need a GPU?
Yeah I'll gist up all my steps + my exact setup. I think you need a GPU, but I'm not 100% sure. Maybe someone else can confirm or deny?
Here's all my steps. I'm using Ubuntu 22.04, Python 3.10, CUDA 11.7 and everything is working on my NVIDIA RTX 2060 SUPER 8G card with the 4bit 7B model.
https://gist.github.com/brandonrobertz/972d943d5442a06aa983ac09cb424a39
I'm not sure if you can install all the dept without a GPU, though. Supposedly setting the CUDA_VISIBLE_DEVICES=-1
environment variable can make CUDA run on the CPU.
Here's all my steps. I'm using Ubuntu 22.04, Python 3.10, CUDA 11.7 and everything is working on my NVIDIA RTX 2060 SUPER 8G card with the 4bit 7B model.
https://gist.github.com/brandonrobertz/972d943d5442a06aa983ac09cb424a39
I'm not sure if you can install all the dept without a GPU, though. Supposedly setting the
CUDA_VISIBLE_DEVICES=-1
environment variable can make CUDA run on the CPU.
After you saved the model. How do you load it?
After you saved the model. How do you load it?
By using the --load
argument (same value as with --save
). The script they have only works on the three datasets in the script. I'm working on getting it doing interactive prompts right now.
These instructions work for systems WITHOUT a GPU: https://til.simonwillison.net/llms/llama-7b-m2 uses this lib: https://github.com/ggerganov/llama.cpp
My pull request (https://github.com/facebookresearch/llama/pull/202) simply made it runnable on CPU with a flag, based on the existing python script. Check it out.
https://github.com/ggerganov/llama.cpp is great, despite losing the accuracy by quantization or fp16. It's way faster if the loss is tolerated.
you can try this simple hacked llama and consume 14g VRAM in fp16 : https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_llama.py
This issue is resolved and should be closed.