llama icon indicating copy to clipboard operation
llama copied to clipboard

Will it run on 3080 GTX 16GB VRAM?

Open pauldog opened this issue 1 year ago • 14 comments

  • Will it run on 3080 GTX 16GB VRAM?
  • Will the trained model be available to download?
  • Will there be an API for this and how much will it cost.

(I doubt it will be small enough to run on 8GB but that would be ideal if it could be compressed enough)

Thanks 😁

pauldog avatar Feb 25 '23 09:02 pauldog

No chance

hollykbuck avatar Feb 25 '23 18:02 hollykbuck

There is a 7B weight model I heard. I mean the float16 version would then be about 14GB. I guess this is out of reach for the 3080. (Maybe it would run on dual graphics cards?) I can run Stable Diffusion but that is only 2-3GB. Perhaps some researcher who has access to the 14GB model may find a way to compress or quantize it down to 3GB. You never know. I'm sure 90% of it is non-English text so if you wanted a purely English language model... The problem is you could probably delete 90% of the weights, and save it in a sparse dataformat. But once you put that on the graphics card it has to expand it into a non-sparse form.

pauldog avatar Feb 25 '23 18:02 pauldog

I believe with CPU offloading, it should be possible.

A related guie: https://huggingface.co/docs/accelerate/usage_guides/big_modeling

sayakpaul avatar Feb 25 '23 19:02 sayakpaul

Or maybe LLM.int8() with https://github.com/TimDettmers/bitsandbytes/

archytasos avatar Feb 25 '23 22:02 archytasos

I believe with CPU offloading, it should be possible.

A related guie: https://huggingface.co/docs/accelerate/usage_guides/big_modeling

Interesting... this might be useful for other models I'm running.

I have a feeling that it will be pretty slow running on personal GPU. But it might be useful just for research purposes.

pauldog avatar Feb 26 '23 02:02 pauldog

FlexGen 4bit is already here, so I think you can run a 20B model locally, let's wait and see...

ye7iaserag avatar Feb 28 '23 20:02 ye7iaserag

FlexGen 4bit is already here, so I think you can run a 20B model locally, let's wait and see...

I've no doubt it can be run locally. The question is the speed. For example I could put 64GB RAM on my desktop and run it with CPU but it would be very slow!

With things like

  • quantization
  • memory offloading
  • breaking model into parts

etc. we should be able to make it run pretty fast with a moderate GPU. And with more people who have access to the model the more people can work out how to optimize it further.

Imagine having a "brain" running on your own PC. Pretty exciting! 😁

pauldog avatar Feb 28 '23 21:02 pauldog

Can anyone instruct me on how to change the numeric type some people is mentioning? The 7B model does not run on a A100 with 32GB of RAM.

vincenzoml avatar Mar 02 '23 10:03 vincenzoml

I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:

https://github.com/modular-ml/wrapyfi-examples_llama

and have a readme with the instructions on how to do it:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!

How to?

  1. Replace all instances of <YOUR_IP> and <YOUR CHECKPOINT DIRECTORY> before running the scripts

  2. Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .
  1. Install Wrapyfi with the same environment:
git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]
  1. Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll
  1. Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
  1. Now start the second instance (within this repo and env) :
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
  1. You will now see the output on both terminals

  2. EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 4 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

fabawi avatar Mar 03 '23 23:03 fabawi

No chance

You can do it with Wrapyfi

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!

How to?

  1. Replace all instances of <YOUR_IP> and <YOUR CHECKPOINT DIRECTORY> before running the scripts

  2. Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .
  1. Install Wrapyfi with the same environment:
git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]
  1. Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll
  1. Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
  1. Now start the second instance (within this repo and env) :
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
  1. You will now see the output on both terminals

  2. EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 4 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

fabawi avatar Mar 03 '23 23:03 fabawi

I noticed that someone has created an int8 fork of the repository at https://github.com/tloen/llama-int8.git. If I understand correctly, running the 7B model using this fork should be relatively easy on a 3080 GPU, although the 14B model may be more challenging.

archytasos avatar Mar 04 '23 21:03 archytasos

No chance

Well it did #105 . (12GB RAM and 16GB VRAM) 😎 Lesson - don't listen do doubters. 😁

pauldog avatar Mar 05 '23 09:03 pauldog

I recomend getting at least 20GB VRAM

Ludobico avatar Mar 15 '23 01:03 Ludobico

@pauldog I have the 12GB 3080...would it run?

jordan-barrett-jm avatar Apr 24 '23 13:04 jordan-barrett-jm

Hi @pauldog, it seems you were able to make it run already, closing this one for now.

albertodepaola avatar Sep 06 '23 16:09 albertodepaola