CCSR icon indicating copy to clipboard operation
CCSR copied to clipboard

trian model CUDA out of memory

Open aoyang-hd opened this issue 1 year ago • 7 comments

Is there any way to train on 24G on a GTX3090, even with one batch size?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 3; 23.69 GiB total capacity; 23.03 GiB already allocated; 21.69 MiB free; 23.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: 0%| | 2/35135 [00:29<144:16:07, 14.78s/it, loss=0.389, v_num=0, train/loss_simple_step=0.131, train/loss_vlb_step=0.000475, train/loss_step=0.131, global_step=0.000, train/loss_x0_step=0.335, train/loss_x0_from_tao_step=0.366, train/loss_noise_from_tao_step=0.00291, train/loss_net_step=0.704]

aoyang-hd avatar Jan 22 '24 08:01 aoyang-hd

Hello, you can try fp16 for training

cswry avatar Feb 08 '24 15:02 cswry

reduce the batch sizes. It is harcoded to 16 but you can reduce them.

On Thu, Feb 8, 2024 at 7:02 AM Rongyuan Wu @.***> wrote:

Hello, you can try fp16 for training

— Reply to this email directly, view it on GitHub https://github.com/csslc/CCSR/issues/12#issuecomment-1934305195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJEO7ZSPT6W3DCG7LBHN3YSTSGZAVCNFSM6AAAAABCEYQOZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGMYDKMJZGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jfischoff avatar Feb 08 '24 16:02 jfischoff

@aoyang-hd @cswry @jfischoff I wanted to ask if you run it successfully on a single GPU. I'd appreciate it if you could reply to me.

zhouyizhuo avatar Mar 01 '24 02:03 zhouyizhuo

yes, I just had to reduce the batch size

jfischoff avatar Mar 05 '24 22:03 jfischoff

@jfischoff How long did it take you to complete the training?(●'◡'●)

zhouyizhuo avatar Mar 06 '24 00:03 zhouyizhuo

I didn't run the complete training like that. I just did a test. I think it took 2 days on A100 8x

jfischoff avatar Mar 06 '24 00:03 jfischoff

Thank you for responding.😊

zhouyizhuo avatar Mar 06 '24 00:03 zhouyizhuo