llama icon indicating copy to clipboard operation
llama copied to clipboard

7B model CUDA out of memory on rtx3090ti 24Gb

Open Jehuty-ML opened this issue 1 year ago • 23 comments

i have seen someone in this issues Message area said that 7B model just needs 8.5G VRAM. but why i ran the example.py returns out of memory on a 24G VRAM cards? any help will be appreciated! Thanks!

Jehuty-ML avatar Mar 06 '23 08:03 Jehuty-ML

@Jehuty-ML might have to do with their recent update to the sequence length (1024 to 2048). Also, try changing the batch size to 2 and reduce the example prompts to an array of size two in example.py.

Just to test things out, try a previous commit to restore the sequence length. If you are interested in distributing the model on multiple machines, try https://github.com/modular-ml/wrapyfi-examples_llama

fabawi avatar Mar 06 '23 08:03 fabawi

Use this branch: https://github.com/tloen/llama-int8.git

I'm able to run it on my 1080Ti with 2048 in and 2048 out tokens, you might be able to run a 14B model instead.

(If the calculation blows up, you might have to change/add threshold value to the Linear8bit layer to 2.0 for example.)

archytasos avatar Mar 06 '23 12:03 archytasos

24G VRAM is more than enough for the 7B model. I ran it with just 12GB RAM and 16GB VRAM. See: #105 You want to set the batch size to 1. (To clarify the 7B model will need about 14GB VRAM. This is because each weight takes 2 bytes each)

For the int8 version, it may take less VRAM but (correct me I'm wrong) it takes a great deal of RAM in order to do the conversion. And after all that it is probably slower. I will try the int8 version later to see what all the fuss is about 😀

elephantpanda avatar Mar 06 '23 15:03 elephantpanda

BTW, people are now trying 3/4bits weights. That might be even more promising:

https://github.com/huggingface/transformers/pull/21955#issuecomment-1456684556

archytasos avatar Mar 06 '23 19:03 archytasos

Adjusting your batch size and reducing the number of prompts will definitely help. It worked for me and I'm on 4x V-100 GPUs. One thing I couldn't wrap my head around is the UnicodeEncodeError I keep seeing while trying to run example.py.

@Jehuty-ML might have to do with their recent update to the sequence length (1024 to 2048). Also, try changing the batch size to 2 and reduce the example prompts to an array of size two in example.py.

Just to test things out, try a previous commit to restore the sequence length. If you are interested in distributing the model on multiple machines, try https://github.com/modular-ml/wrapyfi-examples_llama

AIWithShrey avatar Mar 06 '23 19:03 AIWithShrey

Someone tried with RTX4090 which is 24GB ? Can we fit the 13B model on the RTX 4090 ?

vencolini avatar Mar 07 '23 08:03 vencolini

i have seen someone in this issues Message area said that 7B model just needs 8.5G VRAM. but why i ran the example.py returns out of memory on a 24G VRAM cards? any help will be appreciated! Thanks!

It is in the FAQ. Change max_seq_len to 512 and it should work with defaults on a 24GB card.

neuhaus avatar Mar 07 '23 20:03 neuhaus

man, I have 8GB NVIDIA TX 2080 and can't run it locally, even the 7B model. Options are basically to try and get a Cloud VM I suppose. Sigh.

chrismattmann avatar Mar 08 '23 18:03 chrismattmann

What's crazy is that I just went through all the hoops to get a working Python version with torch (I had 3.6 before and had to up to 3.9 to get it all running and working, and then boom, run it, OoM, then I found this great thread on Google, reduced batch size to 1, and then down to 1 phrase, and still doesn't run. Arrrgh

chrismattmann avatar Mar 08 '23 18:03 chrismattmann

@Jehuty-ML try https://github.com/facebookresearch/llama/issues/166

I also run smoothly with a card with 24 VRAM

soulteary avatar Mar 09 '23 09:03 soulteary

update, I was able to run the CPU version on my Mac but it took 35 seconds to load the model and 30+ minutes to run the example, but at least I was able to get it to run!

chrismattmann avatar Mar 09 '23 15:03 chrismattmann

@chrismattmann Why not try out the new 4bit fork? https://github.com/qwopqwop200/GPTQ-for-LLaMa

archytasos avatar Mar 09 '23 21:03 archytasos

@chrismattmann Why not try out the new 4bit fork? https://github.com/qwopqwop200/GPTQ-for-LLaMa

Thank you I will give this a shot. I have 32Gb on my Kubuntu Focus Gen 2, with NVIDIA TX2080 8Gb RAM in it, so I should be able to use this I guess on the 7B parameter model, right? It's unclear to me the exact steps from reading the README, but I'll look at it in more detail soon. Thank you @archytasos !

chrismattmann avatar Mar 09 '23 22:03 chrismattmann

It's unclear to me the exact steps from reading the README

I was able to get the 4bit version kind of working on 8G 2060 SUPER (still OOM occasionally shrug but mostly works) but you're right the steps are quite unclear. I had to use a specific CUDA version (11.7) compile everything myself (esp the CUDA kernel with python setup_cuda.py build python setup_cuda.py install).

brandonrobertz avatar Mar 09 '23 23:03 brandonrobertz

It's unclear to me the exact steps from reading the README

I was able to get the 4bit version kind of working on 8G 2060 SUPER (still OOM occasionally shrug but mostly works) but you're right the steps are quite unclear. I had to use a specific CUDA version (11.7) compile everything myself (esp the CUDA kernel with python setup_cuda.py build python setup_cuda.py install).

Would you be willing to post a Gist or provide the steps you ran beyond those 2 Python commands? Should I run it on a CPU only machine or do I need a GPU?

chrismattmann avatar Mar 10 '23 00:03 chrismattmann

Would you be willing to post a Gist or provide the steps you ran beyond those 2 Python commands? Should I run it on a CPU only machine or do I need a GPU?

Yeah I'll gist up all my steps + my exact setup. I think you need a GPU, but I'm not 100% sure. Maybe someone else can confirm or deny?

brandonrobertz avatar Mar 10 '23 00:03 brandonrobertz

Here's all my steps. I'm using Ubuntu 22.04, Python 3.10, CUDA 11.7 and everything is working on my NVIDIA RTX 2060 SUPER 8G card with the 4bit 7B model.

https://gist.github.com/brandonrobertz/972d943d5442a06aa983ac09cb424a39

I'm not sure if you can install all the dept without a GPU, though. Supposedly setting the CUDA_VISIBLE_DEVICES=-1 environment variable can make CUDA run on the CPU.

brandonrobertz avatar Mar 10 '23 01:03 brandonrobertz

Here's all my steps. I'm using Ubuntu 22.04, Python 3.10, CUDA 11.7 and everything is working on my NVIDIA RTX 2060 SUPER 8G card with the 4bit 7B model.

https://gist.github.com/brandonrobertz/972d943d5442a06aa983ac09cb424a39

I'm not sure if you can install all the dept without a GPU, though. Supposedly setting the CUDA_VISIBLE_DEVICES=-1 environment variable can make CUDA run on the CPU.

After you saved the model. How do you load it?

elephantpanda avatar Mar 10 '23 09:03 elephantpanda

After you saved the model. How do you load it?

By using the --load argument (same value as with --save). The script they have only works on the three datasets in the script. I'm working on getting it doing interactive prompts right now.

brandonrobertz avatar Mar 10 '23 19:03 brandonrobertz

These instructions work for systems WITHOUT a GPU: https://til.simonwillison.net/llms/llama-7b-m2 uses this lib: https://github.com/ggerganov/llama.cpp

brandonrobertz avatar Mar 11 '23 19:03 brandonrobertz

My pull request (https://github.com/facebookresearch/llama/pull/202) simply made it runnable on CPU with a flag, based on the existing python script. Check it out.

https://github.com/ggerganov/llama.cpp is great, despite losing the accuracy by quantization or fp16. It's way faster if the loss is tolerated.

lintian06 avatar Mar 17 '23 16:03 lintian06

you can try this simple hacked llama and consume 14g VRAM in fp16 : https://github.com/Tongjilibo/bert4torch/blob/master/examples/basic/basic_language_model_llama.py

Tongjilibo avatar Mar 17 '23 16:03 Tongjilibo

This issue is resolved and should be closed.

neuhaus avatar Apr 11 '23 12:04 neuhaus