Co-DETR icon indicating copy to clipboard operation
Co-DETR copied to clipboard

为什么我的显存开销非常大,这正常吗?

Open tyloocifer opened this issue 1 year ago • 13 comments

当我用3090 设置图片尺寸resize为1920*1080,batchsize=1时,显存会直接爆掉,请问我该如何解决这个问题,是哪一步导致了这么大的开销?

tyloocifer avatar Nov 29 '23 07:11 tyloocifer

i used co_dino_5scale_lsj_r50_1x_coco.py in the MMdetection project

tyloocifer avatar Nov 29 '23 07:11 tyloocifer

When i change the model to dino it works. but co_dino doesnt. i try to reduce the num_co_head and only use fasterRcnn or Atss it still require a lot of memory.

tyloocifer avatar Nov 29 '23 07:11 tyloocifer

When i change the model to dino it works. but co_dino doesnt. i try to reduce the num_co_head and only use fasterRcnn or Atss it still require a lot of memory.

tyloocifer avatar Nov 29 '23 07:11 tyloocifer

LSJ aug requires more memory than DETR aug. If you adopt a resolution of 1920x1080, it's better to use the config co_dino_5scale_r50_1x_coco.py. Besides, you can enable checkpointing by adding with_cp=True to backbone config and change the 'with_cp' in encoder config from 4 to 6:

backbone=dict(
    type='ResNet',
    depth=50,
    num_stages=4,
    out_indices=(0, 1, 2, 3),
    frozen_stages=1,
    norm_cfg=dict(type='BN', requires_grad=False),
    norm_eval=True,
    style='pytorch',
    with_cp=True,
    init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),

TempleX98 avatar Nov 29 '23 08:11 TempleX98

it still doesnt work. I just adopt your network. As for dataloader and other config i didnt use it. when I use Dino it just allocate 10G when batchsize=1, but co_dino_r50_1x cant run. it shows CUDA out of memory

tyloocifer avatar Nov 29 '23 10:11 tyloocifer

Do you use DINO-4scale?

TempleX98 avatar Nov 29 '23 10:11 TempleX98

yep

tyloocifer avatar Nov 29 '23 11:11 tyloocifer

perhaps i need to change it into 4scale?

tyloocifer avatar Nov 29 '23 11:11 tyloocifer

Yes, the 5-scale model consumes much more memory than 4-scale

TempleX98 avatar Nov 29 '23 11:11 TempleX98

I use projects/configs/co_dino/co_dino_5scale_swin_large_16e_o365tococo.py, and it seems if I freeze the backbone and set the checkpoint to False, it will OOM in a 24G A30

Feobi1999 avatar Nov 30 '23 03:11 Feobi1999

I use projects/configs/co_dino/co_dino_5scale_swin_large_16e_o365tococo.py, and it seems if I freeze the backbone and set the checkpoint to False, it will OOM in a 24G A30

Co-DETR with frozen SwinL and image size 1333x800 requires more than 15GB memory. The config you use enlarges the resolution by 1.5x and 24GB memory may be insufficient. AMP and FSDP can help you to reduce the training memory.

TempleX98 avatar Nov 30 '23 04:11 TempleX98

if i wanna get a 4-scale model, where should i change except config file.

tyloocifer avatar Dec 04 '23 03:12 tyloocifer

The total loss has been oscillating around 20, is this normal?

tyloocifer avatar Dec 04 '23 03:12 tyloocifer