Failure on A100 32GB
Hi, I've been trying to run the example inference using the 7B model weights, but I get:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 39.59 GiB total capacity; 27.26 GiB already allocated; 24.19 MiB free; 27.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is there anything I can do about this? E.g. changing the numeric type? How?
Also: can I use more than one GPU?
Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total. How could we use multiple gpus?
# | Model | MP |
# |--------|----|
# | 7B | 1 |
# | 13B | 2 |
# | 30B | 4 |
# | 65B | 8 |
export TARGET_FOLDER="models"
export model_size="7B"
export MP="1"
(llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
File "example.py", line 72, in <module>
fire.Fire(main)
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 62, in main
generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
File "example.py", line 48, in load
model = Transformer(model_args)
File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__
self.layers.append(TransformerBlock(layer_id, params))
File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__
self.attention = Attention(args)
File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__
self.wo = RowParallelLinear(
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__
self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3
Traceback (most recent call last):
File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-02_11:49:19
host : activeeon-gpuserver
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2975677)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total. How could we use multiple gpus?
# | Model | MP | # |--------|----| # | 7B | 1 | # | 13B | 2 | # | 30B | 4 | # | 65B | 8 | export TARGET_FOLDER="models" export model_size="7B" export MP="1" (llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model > initializing model parallel with size 1 > initializing ddp with size 1 > initializing pipeline with size 1 Loading Traceback (most recent call last): File "example.py", line 72, in <module> fire.Fire(main) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "example.py", line 62, in main generator = load(ckpt_dir, tokenizer_path, local_rank, world_size) File "example.py", line 48, in load model = Transformer(model_args) File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__ self.layers.append(TransformerBlock(layer_id, params)) File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__ self.attention = Attention(args) File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__ self.wo = RowParallelLinear( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__ self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3 Traceback (most recent call last): File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-02_11:49:19 host : activeeon-gpuserver rank : 0 (local_rank: 0) exitcode : 1 (pid: 2975677) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
I have tried in free tier google colab, which has a Tesla T4 GPU with 15.36GB VRAM and the error message is like yours. Maybe just need more VRAM (7B model has a ckpt file of 13GB size)
Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total. How could we use multiple gpus?
# | Model | MP | # |--------|----| # | 7B | 1 | # | 13B | 2 | # | 30B | 4 | # | 65B | 8 | export TARGET_FOLDER="models" export model_size="7B" export MP="1" (llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model > initializing model parallel with size 1 > initializing ddp with size 1 > initializing pipeline with size 1 Loading Traceback (most recent call last): File "example.py", line 72, in <module> fire.Fire(main) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "example.py", line 62, in main generator = load(ckpt_dir, tokenizer_path, local_rank, world_size) File "example.py", line 48, in load model = Transformer(model_args) File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__ self.layers.append(TransformerBlock(layer_id, params)) File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__ self.attention = Attention(args) File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__ self.wo = RowParallelLinear( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__ self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3 Traceback (most recent call last): File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-02_11:49:19 host : activeeon-gpuserver rank : 0 (local_rank: 0) exitcode : 1 (pid: 2975677) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================I have tried in free tier google colab, which has a Tesla T4 GPU with 15.36GB VRAM and the error message is like yours. Maybe just need more VRAM (7B model has a ckpt file of 13GB size)
Same here
3090 24GB has same error on 7B. There should be GPU mem requirements in README . Please add this.
@vincenzoml Your log show 40GB A100 model, not the 32GB model. Can you confirm?
(GPU 0; 39.59 GiB total capacity;
you can lower the max batch size. See here: https://github.com/facebookresearch/llama/issues/42#issuecomment-1451321954
model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=32, **params)
@vincenzoml Your log show 40GB A100 model, not the 32GB model. Can you confirm?
(GPU 0; 39.59 GiB total capacity;
Yes I confirm, sorry for the mistake.
I confirm that setting max_batch_size=2 in model_args in example.py let my A100 40GB run the example. Setting it to 1 causes an assertion error. I will later investigate if the number can be raised and if it affects runtime.
@vincenzoml if the batch size is 1, then the number of prompts per forward pass should also be 1. https://github.com/facebookresearch/llama/blob/76066b1b5cf467ce750f51af15cd34de442185e7/example.py#L63
I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:
https://github.com/modular-ml/wrapyfi-examples_llama
and have a readme with the instructions on how to do it:
LLaMA with Wrapyfi
Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM
currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!
This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!
How to?
-
Replace all instances of <YOUR_IP> and <YOUR CHECKPOINT DIRECTORY> before running the scripts
-
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:
git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .
- Install Wrapyfi with the same environment:
git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]
- Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone
python zeromq_proxy_broker.py --comm_type pubsubpoll
- Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
- Now start the second instance (within this repo and env) :
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
-
You will now see the output on both terminals
-
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,
### (replace 10.0.0.101 with <YOUR_IP> ###
# step 4 modification
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll
# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total. How could we use multiple gpus?
# | Model | MP | # |--------|----| # | 7B | 1 | # | 13B | 2 | # | 30B | 4 | # | 65B | 8 | export TARGET_FOLDER="models" export model_size="7B" export MP="1" (llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model > initializing model parallel with size 1 > initializing ddp with size 1 > initializing pipeline with size 1 Loading Traceback (most recent call last): File "example.py", line 72, in <module> fire.Fire(main) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "example.py", line 62, in main generator = load(ckpt_dir, tokenizer_path, local_rank, world_size) File "example.py", line 48, in load model = Transformer(model_args) File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__ self.layers.append(TransformerBlock(layer_id, params)) File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__ self.attention = Attention(args) File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__ self.wo = RowParallelLinear( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__ self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3 Traceback (most recent call last): File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-02_11:49:19 host : activeeon-gpuserver rank : 0 (local_rank: 0) exitcode : 1 (pid: 2975677) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Checkout the readme in https://github.com/modular-ml/wrapyfi-examples_llama , I have instructions on how to do it
To address the "CUDA out of memory" error, you can implement the following changes in your code:
- Reduce Batch Size: Decrease the batch size used during training or inference.
batch_size = 4 # Reduce the batch size
- Memory Management: Set the
max_split_size_mboption for memory management in PyTorch. Place these lines before creating the model.
import torch
torch.cuda.set_per_process_memory_fraction(0.5, device=0)
torch.cuda.set_per_process_memory_growth(True, device=0)
- Model Initialization: Initialize your model within a try-except block to catch any
OutOfMemoryErrorand handle it gracefully.
def load_model(ckpt_dir, tokenizer_path, local_rank, world_size):
try:
model = Transformer(model_args)
except torch.cuda.CudaError as e:
print(f"Error initializing the model: {e}")
# Handle the error, such as reducing model size or batch size
return None
return model
generator = load_model(ckpt_dir, tokenizer_path, local_rank, world_size)
if generator is None:
sys.exit(1) # Exit the script if model initialization failed
- Gradient Accumulation: Implement gradient accumulation to simulate larger batch sizes and reduce memory consumption.
accumulation_steps = 2 # Accumulate gradients over 2 small batches
for step in range(total_steps):
for _ in range(accumulation_steps):
# Load data and perform forward and backward passes
loss.backward()
# Update weights after gradient accumulation
optimizer.step()
optimizer.zero_grad()
- Free GPU Memory: Explicitly delete tensors that are no longer needed to free up GPU memory.
del cache_k, cache_v # After these tensors are no longer needed
Remember to experiment and fine-tune these changes to find the optimal settings that fit within your available GPU memory while maintaining training stability and performance.
@zeelsheladiya - feels like a blog post? :) Let me know if you want to author something..
cc @subramen - closing this one out but we should also consider any knowledge transfer to llama-recipes @HamidShojanazeri