torch.cuda.OutOfMemoryError: for multi-gpu training

Open nd7141 opened this issue 2 years ago • 0 comments

Strangely I cannot train a bloomz-7b1 on multi-gpu (8 Tesla V100 with 32GB each), while I can train the same model on a single gpu (NVIDIA A10G 24Gb). I don't have any other processes running.

The code for training can be found here with a training script similar to this one.

In multi-gpu case I get the following log:

ll model checkpoint weights were used when initializing BloomForCausalLM.

All the weights of BloomForCausalLM were initialized from the model checkpoint at models/bloomz-7b1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BloomForCausalLM for predictions without further training.
Generation config file not found, using a generation config created from the model config.
PyTorch: setting up devices
Using cuda_amp half precision backend
***** Running training *****
  Num examples = 244
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 384
  Gradient Accumulation steps = 12
  Total optimization steps = 1
  Number of trainable parameters = 3,932,160
  0%|                                                                                                                                                                                                                                                                      | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/efs/rulm/self_instruct/scripts/train.py", line 177, in <module>
    train(**vars(args))
  File "/home/ec2-user/SageMaker/efs/rulm/self_instruct/scripts/train.py", line 160, in train
    trainer.train(checkpoint)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 2699, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/trainer.py", line 2731, in compute_loss
    outputs = model(**inputs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
torch.cuda.OutOfMemoryError: Caught OutOfMemoryError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/peft/peft_model.py", line 657, in forward
    return self.base_model(
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
    outputs = block(
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 439, in forward
    attn_outputs = self.self_attention(
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/SageMaker/efs/envs/rulm/lib/python3.10/site-packages/transformers/models/bloom/modeling_bloom.py", line 331, in forward
    attention_scores = attention_scores.to(torch.float)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 31.75 GiB total capacity; 30.29 GiB already allocated; 53.94 MiB free; 30.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

While training works normally in a single gpu scenario.

I also use load_in_8bit=False as per this comment for multi-gpu training.

Here is a line to run the code:

python3 scripts/train.py --config-file configs/bloomz_7b1.json --train-file train.jsonl --val-file val.jsonl  --output-dir models/bloomz_7b1_lora

where train.jsonl/val.jsonl can be generated with this script.

Apr 07 '23 12:04 nd7141