mm-cot icon indicating copy to clipboard operation
mm-cot copied to clipboard

CUDA out of memory during training

Open pariskang opened this issue 2 years ago • 3 comments

Description: I encountered a CUDA out of memory error while training my model on one 3090. I ran the following command on the terminal:

bash: CUDA_VISIBLE_DEVICES=0,1 python main.py
--model allenai/unifiedqa-t5-base
--user_msg rationale --img_type detr
--bs 8 --eval_bs 4 --eval_acc 10 --output_len 512
--final_eval --prompt_format QCM-LE The input arguments and the error message are shown below:

====Input Arguments==== { "data_root": "data", "output_dir": "experiments", "model": "allenai/unifiedqa-t5-base", "options": [ "A", "B", "C", "D", "E" ], "epoch": 20, "lr": 5e-05, "bs": 8, "input_len": 512, "output_len": 512, "eval_bs": 4, "eval_acc": 10, "train_split": "train", "val_split": "val", "test_split": "test", "use_generate": false, "final_eval": true, "user_msg": "rationale", "img_type": "detr", "eval_le": null, "test_le": null, "evaluate_dir": null, "caption_file": "data/captions.json", "use_caption": false, "prompt_format": "QCM-LE", "seed": 42 } img_features size: (11208, 100, 256) number of train problems: 12726 number of val problems: 4241 number of test problems: 4241

... torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 1; 23.70 GiB total capacity; 896.00 KiB already allocated; 2.69 MiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF It seems that the program ran out of memory while allocating 96.00 MiB on GPU 1. The GPU has a total capacity of 23.70 GiB, and only 2.69 MiB free memory was available at the time. The error message suggests trying to set max_split_size_mb to avoid fragmentation.

Is there any way to run on single 3090❓ I gonna wanted to know how many GPU needed for train this model. Thank u.

pariskang avatar Feb 25 '23 08:02 pariskang

similar issue

csuestc avatar Feb 25 '23 10:02 csuestc

@pariskang try using this resolve https://github.com/amazon-science/mm-cot/issues/28#issue-1598904259

csuestc avatar Feb 26 '23 02:02 csuestc

Thank u. I will try it later.

@pariskang try using this resolve #28 (comment)

similar issue

pariskang avatar Feb 26 '23 08:02 pariskang