mm-cot icon indicating copy to clipboard operation
mm-cot copied to clipboard

torch.cuda.OutOfMemoryError: CUDA out of memory.

Open 0xhhhhh opened this issue 2 years ago • 1 comments

GPU Info

$ nvidia-smi
Thu Feb 23 06:54:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

command to run

CUDA_VISIBLE_DEVICES=0 python main.py \
    --model allenai/unifiedqa-t5-base \
    --user_msg rationale --img_type detr \
    --bs 8 --eval_bs 4 --eval_acc 10 --output_len 512 \
    --final_eval --prompt_format QCM-LE

error message

[06:54:23] [Model]: Loading allenai/unifiedqa-t5-base...                                                                                                                                                                                                            main.py:68

           [Data]: Reading data...                                                                                                                                                                                                                                  main.py:69

Some weights of T5ForMultimodalGeneration were not initialized from the model checkpoint at allenai/unifiedqa-t5-base and are newly initialized: ['mha_layer.out_proj.weight', 'image_dense.weight', 'mha_layer.in_proj_bias', 'image_dense.bias', 'mha_layer.in_proj_weight', 'gate_dense.bias', 'mha_layer.out_proj.bias', 'gate_dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model parameters:  226643712
***** Running training *****
  Num examples = 12726
  Num Epochs = 20
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 31820
  0%|                                                                                                                                                                                                                                               | 0/31820 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/test/deploy/mm-cot/main.py", line 380, in <module>
    T5Trainer(
  File "/home/test/deploy/mm-cot/main.py", line 269, in T5Trainer
    trainer.train()
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 1498, in train
    return inner_training_loop(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 1740, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 2470, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 2502, in compute_loss
    outputs = model(**inputs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/model.py", line 144, in forward
    decoder_outputs = self.decoder(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1035, in forward
    layer_outputs = layer_module(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 692, in forward
    cross_attention_outputs = self.layer[1](
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 606, in forward
    attention_output = self.EncDecAttention(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 535, in forward
    attn_weights = nn.functional.dropout(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 11.17 GiB total capacity; 10.70 GiB already allocated; 20.25 MiB free; 10.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|

0xhhhhh avatar Feb 23 '23 06:02 0xhhhhh

One possible solution to the OutOfMemoryError is to edit the parameters of "bs". It's in your startup shell:

--bs 8 --eval_bs 4

bs means "batch size", replacing it with a lower value may be helpful.

doveg avatar Feb 24 '23 15:02 doveg