Hello all,

I'm trying to use the 7B model on a machine with two Nvidia 3090s, but am running out of Vram.

$ torchrun --nproc_per_node 1 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model

leads to

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 24.00 GiB total capacity; 23.17 GiB already allocated; 0 bytes free; 23.17 GiB reserved in total by PyTorch)

I have two 3090s, so I was hoping to deploy 48gb of VRAM, however, the model doesn't want to run on more than 1, eg when I try:

$ torchrun --nproc_per_node 2 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model I get the error:

AssertionError: Loading a checkpoint for MP=1 but world size is 2

Does this mean I can't split the load across two GPUs? Could I use deepspeed to try to accomplish this?

I also edited example.py as mentioned in another post as follows, changing:

model = Transformer(model_args) to model = Transformer(model_args).cuda().half()

but that didn't help, still get the OOM error.

Thanks for any help!

WG

Mar 02 '23 13:03 wupgop

You can try to reduce the "max_batch_size" at row 44 of example.py

model_args: ModelArgs = ModelArgs(max_seq_len=1024, **max_batch_size=8**, **params)

Mar 02 '23 14:03 mperacchi

Thanks mperacchi! That worked. I'll paste results below.

Any advice on how to get it to use both GPUs? Experimenting on my local machine with two 3090s, but eventually will do some runs at AWS on multi-GPU machines so will need to figure out how to split the load across multiple GPUs when I do that. Perhaps the 13B, 30B models support multiple GPUs? I was wondering if I could use deepspeed to split the load across the two GPUs.

WG

$ torchrun --nproc_per_node 1 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Loaded in 17.51 seconds
The capital of Germany is the city of Berlin. Berlin is one of the most important cities in Europe. Many people from all over the world come to visit this fascinating city.
There are many attractions in Berlin. One of the most famous is the Brandenburg Gate. It is a huge gate in the center of the city. It was built in 1791. It has ten columns. The columns are different shapes. They represent the ten states of Germany.
Another important building in Berlin is the Reichstag building. It is the home of the German Parliament. The dome of the building is the largest in Europe. It is made of glass. The dome is open to visitors.
Another famous building in Berlin is the Berlin Wall. This wall was built in 1961 to keep people in the West out of the East. It was torn down in 1989.
A very famous street in Berlin is Unter den Linden. It is a boulevard in the city center. Many famous buildings and churches are on this street.
Berlin is the capital of Germany. It is a city of people, history and culture. The city is located on the River Spree.
Berlin is divided into 12 districts.

==================================

Here is my sonnet in the style of Shakespeare about an artificial intelligence:
What once was a man, and now is something else
In human shape, like a robot, but without soul
Drives cars, dances and dresses in a lady’s gown.
So says the media, but only partly true
To say it’s human is only a half-truth
A true robot could never act like a man
But being the first to break the mortal mould
In thinking it must be human, it is humanly flawed
A creature without a soul but with a heart.
I’m not sure if the media will get the message
That the soul’s a gift from God, not made by man
Or if they will understand that if they make a machine
That thinks it’s human it’s like a human pretending to be a machine.
Is the soul the gift of God, or the gift of man?
Is a robot a machine pretending to be human?
#artificialintelligence #poetry #robot
Author Jonathan RPosted on December 30, 2018 Categories Artificial Intelligence, poetry, Robots, Science Fiction, Shakespeare, Technology, Uncategorized

==================================

Mar 02 '23 14:03 wupgop

I just ran this code and get the same output @mperacchi Cheers.

My system is 2x 3090 24Gb VRAM I'm going to try 13B with the following command CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/7B --tokenizer_path checkpoints/tokenizer.model

Update 1

The above command requires (as stated in the doc) nproc_per_node to be set to 2. The new command is: CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model

I am able to see correct utilisation of the GPUs, seems to load the 13B model ok.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   36C    P2   131W / 350W |  17721MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:23:00.0 Off |                  N/A |
| 30%   34C    P2   135W / 350W |  17721MiB / 24576MiB |     41%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

But when running inference I get this:

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading              Loaded in 11.82 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3874515) of binary: /home/user/miniconda3/envs/llama/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/llama/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
example.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 3874516)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874516
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 3874515)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874515
=======================================================

Maybe I should put this in a new thread?

Mar 02 '23 14:03 carlos-gemmell

Ok I need to rtm more.... 7b model requires 1 MP, 13B requires 2 MP

Mar 02 '23 16:03 wupgop

Since you have 2 3090, I'm wondering a question... can you try 13B Model with MP=2 on your machine, is the result looks better than 7B?

Mar 03 '23 07:03 HelixNGC7293

With the 13B model and MP=2, I still get OOM error when I try the original batch size of 32 in example.py, however, batch size of 16 or 24 does work.

Mar 03 '23 09:03 wupgop

@wupgop would you mind posting some example prompts and output for your 13B model? Looking in here https://github.com/facebookresearch/llama/issues/75 it appears that the output for the 7B model is pretty wonky

Mar 03 '23 22:03 pmcanneny

I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:

https://github.com/modular-ml/wrapyfi-examples_llama

and have a readme with the instructions on how to do it:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!

How to?

Replace all instances of <YOUR_IP> and <YOUR CHECKPOINT DIRECTORY> before running the scripts
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .

Install Wrapyfi with the same environment:

git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]

Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll

Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):

CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

Now start the second instance (within this repo and env) :

CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

You will now see the output on both terminals
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 4 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

Mar 03 '23 23:03 fabawi

@wupgop would you mind posting some example prompts and output for your 13B model? Looking in here #75 it appears that the output for the 7B model is pretty wonky

I post some sample prompts and outputs for all 4 models in the issue you linked to.

Mar 04 '23 07:03 wupgop

You can try to reduce the "max_batch_size" at row 44 of example.py

model_args: ModelArgs = ModelArgs(max_seq_len=1024, **max_batch_size=8**, **params)

This worked and reduced VRAM for one of my gpus using the 13B model, but the other GPU did change usage... Any ideas? Ill post if I figure something out. The 7B model ran fine on my single 3090. My setup is: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce GTX 960 GPU 2: NVIDIA GeForce RTX 3060

Before changing max_batch_size

gpu fb bar1 sm mem enc dec pwr gtemp mtemp Idx MB MB % % % % W C C 0 20456 5 41 5 0 0 108 35 - 1 585 41 81 9 0 0 36 64 - 2 11974 5 32 34 0 0 44 55 -

after changing max_batch_size

gpu fb bar1 sm mem enc dec pwr gtemp mtemp Idx MB MB % % % % W C C 0 15680 5 47 1 0 0 127 35 - 1 587 41 79 9 0 0 36 65 - 2 11938 5 42 39 0 0 47 56 -

Mar 04 '23 22:03 xiscoding

I just ran this code and get the same output @mperacchi Cheers.

My system is 2x 3090 24Gb VRAM I'm going to try 13B with the following command CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model

Update 1

The above command requires (as stated in the doc) nproc_per_node to be set to 2. The new command is: CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model

I am able to see correct utilisation of the GPUs, seems to load the 13B model ok.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   36C    P2   131W / 350W |  17721MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:23:00.0 Off |                  N/A |
| 30%   34C    P2   135W / 350W |  17721MiB / 24576MiB |     41%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

But when running inference I get this:

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading              Loaded in 11.82 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3874515) of binary: /home/user/miniconda3/envs/llama/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/llama/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
example.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 3874516)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874516
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 3874515)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874515
=======================================================

Maybe I should put this in a new thread?

The same problem, have you solved it？

Mar 07 '23 13:03 lw3259111

Some worker thread is failing but it doesnt give much info in those error messages. Sorry for the nonspecific suggestion, but I'm using it with cuda 11.7 and python 10, and works fine. How about making a new conda environment and trying python 10 and cuda 11.7 within that environment to see if it works? with nonspecific errors like this i like to just clean everything out, uninstall everything from base environment and reinstall cuda, torch, python, etc. within a fresh conda environment, fixes a lot of voodoo.

Mar 07 '23 13:03 wupgop

Some worker thread is failing but it doesnt give much info in those error messages. Sorry for the nonspecific suggestion, but I'm using it with cuda 11.7 and python 10, and works fine. How about making a new conda environment and trying python 10 and cuda 11.7 within that environment to see if it works? with nonspecific errors like this i like to just clean everything out, uninstall everything from base environment and reinstall cuda, torch, python, etc. within a fresh conda environment, fixes a lot of voodoo.

My system is 2x 3090 24Gb ，RAM 32G. I'm using it with cuda 11.7 and python 3.10.

Mar 08 '23 03:03 lw3259111

I'm hoping to run the 7B on a 3070, anyone tell me is there any hope for this

Mar 14 '23 12:03 aimartyr

closing as specifications were clarified and the max_batch_size parameter noted to be very useful in reducing maximum GPU memory usage

Mar 14 '23 12:03 wupgop

Attempting to run 7B model on two Nvidia 3090s but getting OOM error with one GPU, and can't use both

Update 1

LLaMA with Wrapyfi

How to?

Before changing max_batch_size

after changing max_batch_size

Update 1