llama
llama copied to clipboard
The lowest config that is able to run it?
These are just my educated guesses:
The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.
It probably won't work "straight out of the box" on any commercial gaming GPU, even GPU 3090 GTX due to the small amount of VRAM on these GPUs. But people are working on techniques to share the workload between RAM and VRAM. Even then, it won't be very fast. Worst case several seconds per character.
To get speeds comparable to what you see on with Chat GPT, you will probably need a special GPU with tensor cores and which implements uint8 optimisations.
But all these depend on what you are using for inference: Torch, Tensorflow, Onnxruntime etc. and how efficiently they are implemented. Each of these is getting more efficient at using less memory.
Update: I have got the 7B model to work with 12GB system RAM and a 16GB GPU (Quad P5000) on a Shadow PC. So better than my first guess! 😁 (Although I would recommend 16GB system RAM to load the model faster).
These are just my educated guesses:
The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.
It probably won't work "straight out of the box" on any commercial gaming GPU, even GPU 3090 GTX due to the small amount of VRAM on these GPUs. But people are working on techniques to share the workload between RAM and VRAM. Even then, it won't be very fast. Worst case several seconds per character.
To get speeds comparable to what you see on with Chat GPT, you will probably need a special GPU with tensor cores and which implements uint8 optimisations.
But all these depend on what you are using for inference: Torch, Tensorflow, Onnxruntime etc. and how efficiently they are implemented. Each of these is getting more efficient at using less memory.
Hi pauldog,Where did you obtain these assessment information? I also need to assess it now and hope to have some reference.
Check the FlexGen numbers to get a rough idea.
You can distribute the model on two machines or GPUs and transmit the activations over ZeroMQ. Follow these instructions:
LLaMA with Wrapyfi
Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM
currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!
This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!
How to?
-
Replace all instances of <YOUR_IP> and <YOUR CHECKPOINT DIRECTORY> before running the scripts
-
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:
git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .
- Install Wrapyfi with the same environment:
git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]
- Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone
python zeromq_proxy_broker.py --comm_type pubsubpoll
- Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
- Now start the second instance (within this repo and env) :
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
-
You will now see the output on both terminals
-
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,
### (replace 10.0.0.101 with <YOUR_IP> ###
# step 4 modification
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll
# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.
Running with 32GiB ram on a modern gaming CPU is able to infer multiple words/second in the 7B model. Just change a few lines in the code and it works: https://github.com/markasoftware/llama-cpu
The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.
Running with 32GiB ram on a modern gaming CPU is able to infer multiple words/second in the 7B model. Just change a few lines in the code and it works: markasoftware/llama-cpu
Thanks for your input. I was able to generate about 32 tokens in ~3 mins on laptop APU Ryzen 5800H.
The lowest config to run a 7B model would probably be a laptop with 32GB RAM and no GPU. If the model is exported as float16. But that would be extremely slow! Probably 30 seconds per character just running with the CPU.
Running with 32GiB ram on a modern gaming CPU is able to infer multiple words/second in the 7B model. Just change a few lines in the code and it works: https://github.com/markasoftware/llama-cpu
I stand corrected 😀.
Thanks for your input. I was able to generate about 32 tokens in ~3 mins on laptop APU Ryzen 5800H.
Did you have enough RAM so it did not have to swap while generating the output?
Thanks for your input. I was able to generate about 32 tokens in ~3 mins on laptop APU Ryzen 5800H.
Did you have enough RAM so it did not have to swap while generating the output?
Yes, I have 64gb RAM, and llama-cpu
peaked up to 34gb.
@HamidShojanazeri - is this something we can add as guidance to the llama-recipes repo? This is probably a moving target though so it might be difficult to maintain ground truth..
@hamzaaouni - you mind filing an issue on the https://github.com/facebookresearch/llama-recipes/ repo? I think it makes more sense to use that repo for this..