peft
peft copied to clipboard
torch.cuda.OutOfMemoryError: for multi-gpu training
Strangely I cannot train a bloomz-7b1 on multi-gpu (8 Tesla V100 with 32GB each), while I can train the same model on a single gpu (NVIDIA A10G 24Gb). I don't have any other processes running.
The code for training can be found here with a training script similar to this one.
In multi-gpu case I get the following log:
ll model checkpoint weights were used when initializing BloomForCausalLM.
All the weights of BloomForCausalLM were initialized from the model checkpoint at models/bloomz-7b1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BloomForCausalLM for predictions without further training.
Generation config file not found, using a generation config created from the model config.
PyTorch: setting up devices
Using cuda_amp half precision backend
***** Running training *****
Num examples = 244
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 384
Gradient Accumulation steps = 12
Total optimization steps = 1
Number of trainable parameters = 3,932,160
0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/ec2-user/SageMaker/efs/rulm/self_instruct/scripts/train.py", line 177, in <module>
train(**vars(args))
File "/home/ec2-user/SageMaker/efs/rulm/self_instruct/scripts/train.py", line 160, in train
trainer.train(checkpoint)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 2699, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 2731, in compute_loss
outputs = model(**inputs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
torch.cuda.OutOfMemoryError: Caught OutOfMemoryError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/peft/peft_model.py", line 657, in forward
return self.base_model(
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
transformer_outputs = self.transformer(
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
outputs = block(
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 439, in forward
attn_outputs = self.self_attention(
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 331, in forward
attention_scores = attention_scores.to(torch.float)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 31.75 GiB total capacity; 30.29 GiB already allocated; 53.94 MiB free; 30.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
While training works normally in a single gpu scenario.
I also use load_in_8bit=False as per this comment for multi-gpu training.
Here is a line to run the code:
python3 scripts/train.py --config-file configs/bloomz_7b1.json --train-file train.jsonl --val-file val.jsonl --output-dir models/bloomz_7b1_lora
where train.jsonl/val.jsonl can be generated with this script.