llama icon indicating copy to clipboard operation
llama copied to clipboard

The lowest config that is able to run it?

Open hamzaaouni opened this issue 1 year ago • 9 comments

hamzaaouni avatar Feb 28 '23 00:02 hamzaaouni

These are just my educated guesses:

The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.

It probably won't work "straight out of the box" on any commercial gaming GPU, even GPU 3090 GTX due to the small amount of VRAM on these GPUs. But people are working on techniques to share the workload between RAM and VRAM. Even then, it won't be very fast. Worst case several seconds per character.

To get speeds comparable to what you see on with Chat GPT, you will probably need a special GPU with tensor cores and which implements uint8 optimisations.

But all these depend on what you are using for inference: Torch, Tensorflow, Onnxruntime etc. and how efficiently they are implemented. Each of these is getting more efficient at using less memory.

Update: I have got the 7B model to work with 12GB system RAM and a 16GB GPU (Quad P5000) on a Shadow PC. So better than my first guess! 😁 (Although I would recommend 16GB system RAM to load the model faster).

elephantpanda avatar Feb 28 '23 22:02 elephantpanda

These are just my educated guesses:

The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.

It probably won't work "straight out of the box" on any commercial gaming GPU, even GPU 3090 GTX due to the small amount of VRAM on these GPUs. But people are working on techniques to share the workload between RAM and VRAM. Even then, it won't be very fast. Worst case several seconds per character.

To get speeds comparable to what you see on with Chat GPT, you will probably need a special GPU with tensor cores and which implements uint8 optimisations.

But all these depend on what you are using for inference: Torch, Tensorflow, Onnxruntime etc. and how efficiently they are implemented. Each of these is getting more efficient at using less memory.

Hi pauldog,Where did you obtain these assessment information? I also need to assess it now and hope to have some reference.

yanniszhou avatar Mar 01 '23 03:03 yanniszhou

Check the FlexGen numbers to get a rough idea.

neuhaus avatar Mar 01 '23 19:03 neuhaus

You can distribute the model on two machines or GPUs and transmit the activations over ZeroMQ. Follow these instructions:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!

How to?

  1. Replace all instances of <YOUR_IP> and <YOUR CHECKPOINT DIRECTORY> before running the scripts

  2. Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .
  1. Install Wrapyfi with the same environment:
git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]
  1. Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll
  1. Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
  1. Now start the second instance (within this repo and env) :
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
  1. You will now see the output on both terminals

  2. EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 4 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

fabawi avatar Mar 03 '23 23:03 fabawi

The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.

Running with 32GiB ram on a modern gaming CPU is able to infer multiple words/second in the 7B model. Just change a few lines in the code and it works: https://github.com/markasoftware/llama-cpu

markasoftware avatar Mar 04 '23 00:03 markasoftware

The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.

Running with 32GiB ram on a modern gaming CPU is able to infer multiple words/second in the 7B model. Just change a few lines in the code and it works: markasoftware/llama-cpu

Thanks for your input. I was able to generate about 32 tokens in ~3 mins on laptop APU Ryzen 5800H.

iamwavecut avatar Mar 04 '23 14:03 iamwavecut

The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.

Running with 32GiB ram on a modern gaming CPU is able to infer multiple words/second in the 7B model. Just change a few lines in the code and it works: https://github.com/markasoftware/llama-cpu

I stand corrected 😀.

elephantpanda avatar Mar 04 '23 15:03 elephantpanda

Thanks for your input. I was able to generate about 32 tokens in ~3 mins on laptop APU Ryzen 5800H.

Did you have enough RAM so it did not have to swap while generating the output?

neuhaus avatar Mar 08 '23 15:03 neuhaus

Thanks for your input. I was able to generate about 32 tokens in ~3 mins on laptop APU Ryzen 5800H.

Did you have enough RAM so it did not have to swap while generating the output?

Yes, I have 64gb RAM, and llama-cpu peaked up to 34gb.

iamwavecut avatar Mar 13 '23 22:03 iamwavecut

@HamidShojanazeri - is this something we can add as guidance to the llama-recipes repo? This is probably a moving target though so it might be difficult to maintain ground truth..

jspisak avatar Sep 06 '23 16:09 jspisak

@hamzaaouni - you mind filing an issue on the https://github.com/facebookresearch/llama-recipes/ repo? I think it makes more sense to use that repo for this..

jspisak avatar Sep 06 '23 16:09 jspisak