In mmdetection3.0, memory keep increasing fast in the training process of DETR-like object detectors, while in mmdetection2.52.2 the memory increases slowly.
When I train DETR-like object detectors (e.g. DETR, DINO...) in mmdetection3.0, the occupied memory of RAM will increase fast, so the training process will be killed when there is no free space in RAM. However, when I switch to mmdetection2.52.2, the occupied memory of RAM will increase slowly.
In mmdetection2.52.2, the RAM usage and other information in the training process of DETR are as follows:
2023-05-10 20:10:52,595 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.6
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.1
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.6
MMDetection: 2.25.2+9d3e162
------------------------------------------------------------
2023-05-10 20:10:55,044 - mmdet - INFO - Distributed training: True
2023-05-10 20:10:57,363 - mmdet - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='AutoAugment',
policies=[[{
'type':
'Resize',
'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333),
(608, 1333), (640, 1333), (672, 1333), (704, 1333),
(736, 1333), (768, 1333), (800, 1333)],
'multiscale_mode':
'value',
'keep_ratio':
True
}],
[{
'type': 'Resize',
'img_scale': [(400, 1333), (500, 1333), (600, 1333)],
'multiscale_mode': 'value',
'keep_ratio': True
}, {
'type': 'RandomCrop',
'crop_type': 'absolute_range',
'crop_size': (384, 600),
'allow_negative_crop': True
}, {
'type':
'Resize',
'img_scale': [(480, 1333), (512, 1333), (544, 1333),
(576, 1333), (608, 1333), (640, 1333),
(672, 1333), (704, 1333), (736, 1333),
(768, 1333), (800, 1333)],
'multiscale_mode':
'value',
'override':
True,
'keep_ratio':
True
}]]),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=1),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=1),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_train2017.json',
img_prefix='data/coco/train2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='AutoAugment',
policies=[[{
'type':
'Resize',
'img_scale': [(480, 1333), (512, 1333), (544, 1333),
(576, 1333), (608, 1333), (640, 1333),
(672, 1333), (704, 1333), (736, 1333),
(768, 1333), (800, 1333)],
'multiscale_mode':
'value',
'keep_ratio':
True
}],
[{
'type': 'Resize',
'img_scale': [(400, 1333), (500, 1333),
(600, 1333)],
'multiscale_mode': 'value',
'keep_ratio': True
}, {
'type': 'RandomCrop',
'crop_type': 'absolute_range',
'crop_size': (384, 600),
'allow_negative_crop': True
}, {
'type':
'Resize',
'img_scale': [(480, 1333), (512, 1333),
(544, 1333), (576, 1333),
(608, 1333), (640, 1333),
(672, 1333), (704, 1333),
(736, 1333), (768, 1333),
(800, 1333)],
'multiscale_mode':
'value',
'override':
True,
'keep_ratio':
True
}]]),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=1),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]),
val=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/val2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=1),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/val2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=1),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
evaluation = dict(interval=1, metric='bbox')
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='MemoryProfilerHook', interval=50)]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
auto_scale_lr = dict(enable=False, base_batch_size=16)
model = dict(
type='DETR',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(3, ),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=False),
norm_eval=True,
style='pytorch',
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
bbox_head=dict(
type='DETRHead',
num_classes=80,
in_channels=2048,
transformer=dict(
type='Transformer',
encoder=dict(
type='DetrTransformerEncoder',
num_layers=6,
transformerlayers=dict(
type='BaseTransformerLayer',
attn_cfgs=[
dict(
type='MultiheadAttention',
embed_dims=256,
num_heads=8,
dropout=0.1)
],
feedforward_channels=2048,
ffn_dropout=0.1,
operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
decoder=dict(
type='DetrTransformerDecoder',
return_intermediate=True,
num_layers=6,
transformerlayers=dict(
type='DetrTransformerDecoderLayer',
attn_cfgs=dict(
type='MultiheadAttention',
embed_dims=256,
num_heads=8,
dropout=0.1),
feedforward_channels=2048,
ffn_dropout=0.1,
operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
'ffn', 'norm')))),
positional_encoding=dict(
type='SinePositionalEncoding', num_feats=128, normalize=True),
loss_cls=dict(
type='CrossEntropyLoss',
bg_cls_weight=0.1,
use_sigmoid=False,
loss_weight=1.0,
class_weight=1.0),
loss_bbox=dict(type='L1Loss', loss_weight=5.0),
loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
train_cfg=dict(
assigner=dict(
type='HungarianAssigner',
cls_cost=dict(type='ClassificationCost', weight=1.0),
reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0))),
test_cfg=dict(max_per_img=100))
optimizer = dict(
type='AdamW',
lr=0.0001,
weight_decay=0.0001,
paramwise_cfg=dict(
custom_keys=dict(backbone=dict(lr_mult=0.1, decay_mult=1.0))))
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
lr_config = dict(policy='step', step=[100])
runner = dict(type='EpochBasedRunner', max_epochs=150)
work_dir = './work_dirs/detr_r50_8x2_150e_coco'
auto_resume = False
gpu_ids = range(0, 4)
2023-05-10 20:10:57,363 - mmdet - INFO - Set random seed to 0, deterministic: False
2023-05-10 20:10:57,684 - mmdet - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet50'}
2023-05-10 20:10:57,685 - mmcv - INFO - load model from: torchvision://resnet50
2023-05-10 20:10:57,685 - mmcv - INFO - load checkpoint from torchvision path: torchvision://resnet50
2023-05-10 20:10:59,315 - mmcv - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=13.04s)
creating index...
Done (t=13.22s)
creating index...
Done (t=13.16s)
creating index...
Done (t=13.19s)
creating index...
index created!
index created!
index created!
index created!
2023-05-10 20:11:17,691 - mmdet - INFO - Automatic scaling of learning rate (LR) has been disabled.
loading annotations into memory...loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.37s)
creating index...
Done (t=0.37s)
creating index...
Done (t=0.38s)
creating index...
Done (t=0.39s)
creating index...
index created!
index created!
index created!
index created!
2023-05-10 20:11:18,181 - mmdet - INFO - Start running, host: zhaorui@L1806-1, work_dir: /home/zhaorui/CV-Code/corner_case_mmdetection/work_dirs/detr_r50_8x2_150e_coco
2023-05-10 20:11:18,181 - mmdet - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) DistEvalHook
(VERY_LOW ) TextLoggerHook
--------------------
before_train_epoch:
(VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(LOW ) DistEvalHook
(VERY_LOW ) TextLoggerHook
--------------------
before_train_iter:
(VERY_HIGH ) StepLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) DistEvalHook
--------------------
after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(NORMAL ) MemoryProfilerHook
(LOW ) IterTimerHook
(LOW ) DistEvalHook
(VERY_LOW ) TextLoggerHook
--------------------
after_train_epoch:
(NORMAL ) CheckpointHook
(LOW ) DistEvalHook
(VERY_LOW ) TextLoggerHook
--------------------
before_val_epoch:
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
--------------------
before_val_iter:
(LOW ) IterTimerHook
--------------------
after_val_iter:
(NORMAL ) MemoryProfilerHook
(LOW ) IterTimerHook
--------------------
after_val_epoch:
(VERY_LOW ) TextLoggerHook
--------------------
after_run:
(VERY_LOW ) TextLoggerHook
--------------------
2023-05-10 20:11:18,181 - mmdet - INFO - workflow: [('train', 1)], max: 150 epochs
2023-05-10 20:11:18,181 - mmdet - INFO - Checkpoints will be saved to /home/zhaorui/CV-Code/corner_case_mmdetection/work_dirs/detr_r50_8x2_150e_coco by HardDiskBackend.
2023-05-10 20:11:24,608 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
2023-05-10 20:11:35,244 - mmdet - INFO - Memory information available_memory: 182824 MB, used_memory: 72203 MB, memory_utilization: 29.0 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7163 MB
2023-05-10 20:11:35,254 - mmdet - INFO - Epoch [1][50/14659] lr: 1.000e-04, eta: 8 days, 16:24:28, time: 0.341, data_time: 0.067, memory: 4392, loss_cls: 2.1938, loss_bbox: 3.9803, loss_iou: 2.5871, d0.loss_cls: 2.2074, d0.loss_bbox: 3.9588, d0.loss_iou: 2.5466, d1.loss_cls: 2.1931, d1.loss_bbox: 3.9829, d1.loss_iou: 2.5745, d2.loss_cls: 2.1761, d2.loss_bbox: 3.9786, d2.loss_iou: 2.5937, d3.loss_cls: 2.1937, d3.loss_bbox: 3.9686, d3.loss_iou: 2.6020, d4.loss_cls: 2.1737, d4.loss_bbox: 3.9452, d4.loss_iou: 2.6070, loss: 52.4630, grad_norm: 102.9133
2023-05-10 20:11:46,068 - mmdet - INFO - Memory information available_memory: 182818 MB, used_memory: 72243 MB, memory_utilization: 29.0 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7159 MB
2023-05-10 20:11:46,072 - mmdet - INFO - Epoch [1][100/14659] lr: 1.000e-04, eta: 7 days, 2:18:29, time: 0.216, data_time: 0.006, memory: 4392, loss_cls: 1.9114, loss_bbox: 3.1279, loss_iou: 2.3006, d0.loss_cls: 1.9293, d0.loss_bbox: 3.0290, d0.loss_iou: 2.1662, d1.loss_cls: 1.9299, d1.loss_bbox: 3.0307, d1.loss_iou: 2.1783, d2.loss_cls: 1.9355, d2.loss_bbox: 3.0487, d2.loss_iou: 2.2081, d3.loss_cls: 1.9106, d3.loss_bbox: 3.1001, d3.loss_iou: 2.2756, d4.loss_cls: 1.9057, d4.loss_bbox: 3.1560, d4.loss_iou: 2.3314, loss: 43.4750, grad_norm: 152.1985
2023-05-10 20:11:56,751 - mmdet - INFO - Memory information available_memory: 182715 MB, used_memory: 72352 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7158 MB
2023-05-10 20:11:56,762 - mmdet - INFO - Epoch [1][150/14659] lr: 1.000e-04, eta: 6 days, 13:01:41, time: 0.214, data_time: 0.006, memory: 4830, loss_cls: 2.0447, loss_bbox: 2.4544, loss_iou: 2.1047, d0.loss_cls: 2.0233, d0.loss_bbox: 2.4988, d0.loss_iou: 2.0763, d1.loss_cls: 2.0397, d1.loss_bbox: 2.4348, d1.loss_iou: 2.0512, d2.loss_cls: 2.0682, d2.loss_bbox: 2.4394, d2.loss_iou: 2.0628, d3.loss_cls: 2.0695, d3.loss_bbox: 2.4528, d3.loss_iou: 2.0829, d4.loss_cls: 2.0511, d4.loss_bbox: 2.4545, d4.loss_iou: 2.0862, loss: 39.4952, grad_norm: 259.5170
2023-05-10 20:12:07,251 - mmdet - INFO - Memory information available_memory: 182677 MB, used_memory: 72390 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7158 MB
2023-05-10 20:12:07,263 - mmdet - INFO - Epoch [1][200/14659] lr: 1.000e-04, eta: 6 days, 5:49:50, time: 0.210, data_time: 0.006, memory: 4830, loss_cls: 1.9450, loss_bbox: 2.2313, loss_iou: 1.8283, d0.loss_cls: 1.9235, d0.loss_bbox: 2.2384, d0.loss_iou: 1.8772, d1.loss_cls: 1.9414, d1.loss_bbox: 2.1388, d1.loss_iou: 1.8145, d2.loss_cls: 1.9658, d2.loss_bbox: 2.1521, d2.loss_iou: 1.7916, d3.loss_cls: 1.9551, d3.loss_bbox: 2.1625, d3.loss_iou: 1.7988, d4.loss_cls: 1.9459, d4.loss_bbox: 2.2229, d4.loss_iou: 1.8349, loss: 35.7679, grad_norm: 323.9576
2023-05-10 20:12:18,064 - mmdet - INFO - Memory information available_memory: 182642 MB, used_memory: 72402 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7158 MB
2023-05-10 20:12:18,075 - mmdet - INFO - Epoch [1][250/14659] lr: 1.000e-04, eta: 6 days, 2:16:32, time: 0.216, data_time: 0.006, memory: 4830, loss_cls: 2.0603, loss_bbox: 1.8460, loss_iou: 1.7571, d0.loss_cls: 2.0576, d0.loss_bbox: 1.7854, d0.loss_iou: 1.8114, d1.loss_cls: 2.0570, d1.loss_bbox: 1.7105, d1.loss_iou: 1.7802, d2.loss_cls: 2.0805, d2.loss_bbox: 1.7049, d2.loss_iou: 1.7625, d3.loss_cls: 2.0608, d3.loss_bbox: 1.7246, d3.loss_iou: 1.7406, d4.loss_cls: 2.0551, d4.loss_bbox: 1.7979, d4.loss_iou: 1.7672, loss: 33.5596, grad_norm: 339.6177
2023-05-10 20:12:28,839 - mmdet - INFO - Memory information available_memory: 182488 MB, used_memory: 72579 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7161 MB
2023-05-10 20:12:28,850 - mmdet - INFO - Epoch [1][300/14659] lr: 1.000e-04, eta: 5 days, 23:49:38, time: 0.215, data_time: 0.006, memory: 4830, loss_cls: 1.9402, loss_bbox: 1.5102, loss_iou: 1.7017, d0.loss_cls: 1.9407, d0.loss_bbox: 1.6412, d0.loss_iou: 1.7632, d1.loss_cls: 1.9474, d1.loss_bbox: 1.5363, d1.loss_iou: 1.7276, d2.loss_cls: 1.9236, d2.loss_bbox: 1.5126, d2.loss_iou: 1.7041, d3.loss_cls: 1.9230, d3.loss_bbox: 1.4893, d3.loss_iou: 1.6908, d4.loss_cls: 1.9429, d4.loss_bbox: 1.5004, d4.loss_iou: 1.6998, loss: 31.0952, grad_norm: 336.9932
2023-05-10 20:12:39,560 - mmdet - INFO - Memory information available_memory: 182534 MB, used_memory: 72560 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7159 MB
2023-05-10 20:12:39,571 - mmdet - INFO - Epoch [1][350/14659] lr: 1.000e-04, eta: 5 days, 21:58:59, time: 0.214, data_time: 0.006, memory: 4830, loss_cls: 1.9630, loss_bbox: 1.5236, loss_iou: 1.7866, d0.loss_cls: 1.9710, d0.loss_bbox: 1.5336, d0.loss_iou: 1.7642, d1.loss_cls: 1.9666, d1.loss_bbox: 1.4777, d1.loss_iou: 1.7624, d2.loss_cls: 1.9678, d2.loss_bbox: 1.5017, d2.loss_iou: 1.7573, d3.loss_cls: 1.9619, d3.loss_bbox: 1.5048, d3.loss_iou: 1.7810, d4.loss_cls: 1.9643, d4.loss_bbox: 1.4852, d4.loss_iou: 1.7721, loss: 31.4447, grad_norm: 262.3158
2023-05-10 20:12:50,185 - mmdet - INFO - Memory information available_memory: 182338 MB, used_memory: 72678 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7159 MB
2023-05-10 20:12:50,196 - mmdet - INFO - Epoch [1][400/14659] lr: 1.000e-04, eta: 5 days, 20:27:14, time: 0.213, data_time: 0.006, memory: 4830, loss_cls: 1.8691, loss_bbox: 1.4978, loss_iou: 1.7108, d0.loss_cls: 1.8743, d0.loss_bbox: 1.5374, d0.loss_iou: 1.7244, d1.loss_cls: 1.8739, d1.loss_bbox: 1.4835, d1.loss_iou: 1.6921, d2.loss_cls: 1.8913, d2.loss_bbox: 1.4706, d2.loss_iou: 1.6746, d3.loss_cls: 1.8762, d3.loss_bbox: 1.4604, d3.loss_iou: 1.6962, d4.loss_cls: 1.8708, d4.loss_bbox: 1.4814, d4.loss_iou: 1.6969, loss: 30.3819, grad_norm: 243.9912
2023-05-10 20:13:00,927 - mmdet - INFO - Memory information available_memory: 182314 MB, used_memory: 72738 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7160 MB
2023-05-10 20:13:00,938 - mmdet - INFO - Epoch [1][450/14659] lr: 1.000e-04, eta: 5 days, 19:25:18, time: 0.215, data_time: 0.006, memory: 4830, loss_cls: 1.8927, loss_bbox: 1.3934, loss_iou: 1.7035, d0.loss_cls: 1.9322, d0.loss_bbox: 1.4574, d0.loss_iou: 1.6973, d1.loss_cls: 1.9193, d1.loss_bbox: 1.3880, d1.loss_iou: 1.6914, d2.loss_cls: 1.9064, d2.loss_bbox: 1.3733, d2.loss_iou: 1.6775, d3.loss_cls: 1.9074, d3.loss_bbox: 1.3528, d3.loss_iou: 1.6721, d4.loss_cls: 1.9046, d4.loss_bbox: 1.3834, d4.loss_iou: 1.6851, loss: 29.9380, grad_norm: 233.0428
2023-05-10 20:13:11,514 - mmdet - INFO - Memory information available_memory: 182325 MB, used_memory: 72722 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7166 MB
2023-05-10 20:13:11,531 - mmdet - INFO - Epoch [1][500/14659] lr: 1.000e-04, eta: 5 days, 18:24:22, time: 0.212, data_time: 0.006, memory: 4830, loss_cls: 1.8838, loss_bbox: 1.3841, loss_iou: 1.7395, d0.loss_cls: 1.9165, d0.loss_bbox: 1.4486, d0.loss_iou: 1.7354, d1.loss_cls: 1.8909, d1.loss_bbox: 1.3882, d1.loss_iou: 1.7304, d2.loss_cls: 1.8895, d2.loss_bbox: 1.3480, d2.loss_iou: 1.7022, d3.loss_cls: 1.8834, d3.loss_bbox: 1.3319, d3.loss_iou: 1.6993, d4.loss_cls: 1.8815, d4.loss_bbox: 1.3598, d4.loss_iou: 1.7244, loss: 29.9373, grad_norm: 220.6434
2023-05-10 20:13:22,229 - mmdet - INFO - Memory information available_memory: 182335 MB, used_memory: 72731 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7168 MB
2023-05-10 20:13:22,239 - mmdet - INFO - Epoch [1][550/14659] lr: 1.000e-04, eta: 5 days, 17:43:01, time: 0.214, data_time: 0.007, memory: 4830, loss_cls: 1.9474, loss_bbox: 1.4369, loss_iou: 1.7117, d0.loss_cls: 1.9816, d0.loss_bbox: 1.5054, d0.loss_iou: 1.7683, d1.loss_cls: 1.9604, d1.loss_bbox: 1.4625, d1.loss_iou: 1.7264, d2.loss_cls: 1.9607, d2.loss_bbox: 1.4035, d2.loss_iou: 1.6959, d3.loss_cls: 1.9442, d3.loss_bbox: 1.4316, d3.loss_iou: 1.6960, d4.loss_cls: 1.9440, d4.loss_bbox: 1.4106, d4.loss_iou: 1.6962, loss: 30.6832, grad_norm: 191.9449
2023-05-10 20:13:32,960 - mmdet - INFO - Memory information available_memory: 182194 MB, used_memory: 72868 MB, memory_utilization: 29.3 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7169 MB
2023-05-10 20:13:32,971 - mmdet - INFO - Epoch [1][600/14659] lr: 1.000e-04, eta: 5 days, 17:09:29, time: 0.215, data_time: 0.006, memory: 4830, loss_cls: 1.8308, loss_bbox: 1.3737, loss_iou: 1.6855, d0.loss_cls: 1.8474, d0.loss_bbox: 1.3880, d0.loss_iou: 1.6779, d1.loss_cls: 1.8536, d1.loss_bbox: 1.3698, d1.loss_iou: 1.6720, d2.loss_cls: 1.8556, d2.loss_bbox: 1.3603, d2.loss_iou: 1.6834, d3.loss_cls: 1.8424, d3.loss_bbox: 1.3493, d3.loss_iou: 1.6917, d4.loss_cls: 1.8346, d4.loss_bbox: 1.3686, d4.loss_iou: 1.7034, loss: 29.3879, grad_norm: 189.7734
2023-05-10 20:13:43,612 - mmdet - INFO - Memory information available_memory: 182169 MB, used_memory: 72878 MB, memory_utilization: 29.3 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7160 MB
2023-05-10 20:13:43,627 - mmdet - INFO - Epoch [1][650/14659] lr: 1.000e-04, eta: 5 days, 16:36:41, time: 0.213, data_time: 0.006, memory: 4830, loss_cls: 1.8788, loss_bbox: 1.4147, loss_iou: 1.7250, d0.loss_cls: 1.8995, d0.loss_bbox: 1.4720, d0.loss_iou: 1.7449, d1.loss_cls: 1.8870, d1.loss_bbox: 1.4685, d1.loss_iou: 1.7655, d2.loss_cls: 1.9009, d2.loss_bbox: 1.4162, d2.loss_iou: 1.7115, d3.loss_cls: 1.8853, d3.loss_bbox: 1.3990, d3.loss_iou: 1.7044, d4.loss_cls: 1.8818, d4.loss_bbox: 1.3866, d4.loss_iou: 1.7156, loss: 30.2570, grad_norm: 180.2028
In mmdetection3.0.0, the RAM usage and other information in the training process of DETR are as follows:
05/10 20:31:25 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: linux
Python: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 1656719191
GPU 0,1,2,3: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0
PyTorch: 2.0.0.post200
PyTorch compiling details: PyTorch built with:
- GCC 10.4
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.8
- Built with CUDA Runtime 11.2
- NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
- CuDNN 8.4.1 (built against CUDA 11.6)
- Magma 2.7.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.2, CUDNN_VERSION=8.4.1, CXX_COMPILER=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/_build_env/bin/x86_64-conda-linux-gnu-c++, CXX_FLAGS=-std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/work=/usr/local/src/conda/pytorch-2.0.0 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -Wno-deprecated-declarations -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=1, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.1a0
OpenCV: 4.7.0
MMEngine: 0.7.0
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: None
Distributed launcher: pytorch
Distributed training: True
GPU number: 4
------------------------------------------------------------
05/10 20:31:27 - mmengine - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
backend_args = None
train_pipeline = [
dict(type='LoadImageFromFile', backend_args=None),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomFlip', prob=0.5),
dict(
type='RandomChoice',
transforms=[[{
'type':
'RandomChoiceResize',
'scales': [(480, 1333), (512, 1333), (544, 1333), (576, 1333),
(608, 1333), (640, 1333), (672, 1333), (704, 1333),
(736, 1333), (768, 1333), (800, 1333)],
'keep_ratio':
True
}],
[{
'type': 'RandomChoiceResize',
'scales': [(400, 1333), (500, 1333), (600, 1333)],
'keep_ratio': True
}, {
'type': 'RandomCrop',
'crop_type': 'absolute_range',
'crop_size': (384, 600),
'allow_negative_crop': True
}, {
'type':
'RandomChoiceResize',
'scales': [(480, 1333), (512, 1333), (544, 1333),
(576, 1333), (608, 1333), (640, 1333),
(672, 1333), (704, 1333), (736, 1333),
(768, 1333), (800, 1333)],
'keep_ratio':
True
}]]),
dict(type='PackDetInputs')
]
test_pipeline = [
dict(type='LoadImageFromFile', backend_args=None),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor'))
]
train_dataloader = dict(
batch_size=2,
num_workers=0,
persistent_workers=False,
sampler=dict(type='DefaultSampler', shuffle=True),
batch_sampler=dict(type='AspectRatioBatchSampler'),
dataset=dict(
type='CocoDataset',
data_root='data/coco/',
ann_file='annotations/instances_train2017.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=[
dict(type='LoadImageFromFile', backend_args=None),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomFlip', prob=0.5),
dict(
type='RandomChoice',
transforms=[[{
'type':
'RandomChoiceResize',
'scales': [(480, 1333), (512, 1333), (544, 1333),
(576, 1333), (608, 1333), (640, 1333),
(672, 1333), (704, 1333), (736, 1333),
(768, 1333), (800, 1333)],
'keep_ratio':
True
}],
[{
'type': 'RandomChoiceResize',
'scales': [(400, 1333), (500, 1333),
(600, 1333)],
'keep_ratio': True
}, {
'type': 'RandomCrop',
'crop_type': 'absolute_range',
'crop_size': (384, 600),
'allow_negative_crop': True
}, {
'type':
'RandomChoiceResize',
'scales':
[(480, 1333), (512, 1333), (544, 1333),
(576, 1333), (608, 1333), (640, 1333),
(672, 1333), (704, 1333), (736, 1333),
(768, 1333), (800, 1333)],
'keep_ratio':
True
}]]),
dict(type='PackDetInputs')
],
backend_args=None))
val_dataloader = dict(
batch_size=1,
num_workers=0,
persistent_workers=False,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type='CocoDataset',
data_root='data/coco/',
ann_file='annotations/instances_val2017.json',
data_prefix=dict(img='val2017/'),
test_mode=True,
pipeline=[
dict(type='LoadImageFromFile', backend_args=None),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor'))
],
backend_args=None))
test_dataloader = dict(
batch_size=1,
num_workers=0,
persistent_workers=False,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type='CocoDataset',
data_root='data/coco/',
ann_file='annotations/instances_val2017.json',
data_prefix=dict(img='val2017/'),
test_mode=True,
pipeline=[
dict(type='LoadImageFromFile', backend_args=None),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor'))
],
backend_args=None))
val_evaluator = dict(
type='CocoMetric',
ann_file='data/coco/annotations/instances_val2017.json',
metric='bbox',
format_only=False,
backend_args=None)
test_evaluator = dict(
type='CocoMetric',
ann_file='data/coco/annotations/instances_val2017.json',
metric='bbox',
format_only=False,
backend_args=None)
default_scope = 'mmdet'
default_hooks = dict(
timer=dict(type='IterTimerHook'),
logger=dict(type='LoggerHook', interval=50),
param_scheduler=dict(type='ParamSchedulerHook'),
checkpoint=dict(type='CheckpointHook', interval=1),
sampler_seed=dict(type='DistSamplerSeedHook'),
visualization=dict(type='DetVisualizationHook'))
env_cfg = dict(
cudnn_benchmark=False,
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
dist_cfg=dict(backend='nccl'))
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
type='DetLocalVisualizer',
vis_backends=[dict(type='LocalVisBackend')],
name='visualizer')
log_processor = dict(type='LogProcessor', window_size=50, by_epoch=True)
log_level = 'INFO'
load_from = None
resume = False
model = dict(
type='DETR',
num_queries=100,
data_preprocessor=dict(
type='DetDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
bgr_to_rgb=True,
pad_size_divisor=1),
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(3, ),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=False),
norm_eval=True,
style='pytorch',
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
neck=dict(
type='ChannelMapper',
in_channels=[2048],
kernel_size=1,
out_channels=256,
act_cfg=None,
norm_cfg=None,
num_outs=1),
encoder=dict(
num_layers=6,
layer_cfg=dict(
self_attn_cfg=dict(
embed_dims=256, num_heads=8, dropout=0.1, batch_first=True),
ffn_cfg=dict(
embed_dims=256,
feedforward_channels=2048,
num_fcs=2,
ffn_drop=0.1,
act_cfg=dict(type='ReLU', inplace=True)))),
decoder=dict(
num_layers=6,
layer_cfg=dict(
self_attn_cfg=dict(
embed_dims=256, num_heads=8, dropout=0.1, batch_first=True),
cross_attn_cfg=dict(
embed_dims=256, num_heads=8, dropout=0.1, batch_first=True),
ffn_cfg=dict(
embed_dims=256,
feedforward_channels=2048,
num_fcs=2,
ffn_drop=0.1,
act_cfg=dict(type='ReLU', inplace=True))),
return_intermediate=True),
positional_encoding=dict(num_feats=128, normalize=True),
bbox_head=dict(
type='DETRHead',
num_classes=80,
embed_dims=256,
loss_cls=dict(
type='CrossEntropyLoss',
bg_cls_weight=0.1,
use_sigmoid=False,
loss_weight=1.0,
class_weight=1.0),
loss_bbox=dict(type='L1Loss', loss_weight=5.0),
loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
train_cfg=dict(
assigner=dict(
type='HungarianAssigner',
match_costs=[
dict(type='ClassificationCost', weight=1.0),
dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
dict(type='IoUCost', iou_mode='giou', weight=2.0)
])),
test_cfg=dict(max_per_img=100))
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='AdamW', lr=0.0001, weight_decay=0.0001),
clip_grad=dict(max_norm=0.1, norm_type=2),
paramwise_cfg=dict(
custom_keys=dict(backbone=dict(lr_mult=0.1, decay_mult=1.0))))
max_epochs = 150
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=150, val_interval=1)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
param_scheduler = [
dict(
type='MultiStepLR',
begin=0,
end=150,
by_epoch=True,
milestones=[100],
gamma=0.1)
]
auto_scale_lr = dict(base_batch_size=16)
custom_hooks = [dict(type='MemoryProfilerHook', interval=50)]
launcher = 'pytorch'
work_dir = './work_dirs/detr_r50_8xb2-150e_coco'
05/10 20:31:29 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook
--------------------
before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(VERY_LOW ) CheckpointHook
--------------------
before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook
--------------------
before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
--------------------
after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) MemoryProfilerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
before_val_epoch:
(NORMAL ) IterTimerHook
--------------------
before_val_iter:
(NORMAL ) IterTimerHook
--------------------
after_val_iter:
(NORMAL ) IterTimerHook
(NORMAL ) DetVisualizationHook
(NORMAL ) MemoryProfilerHook
(BELOW_NORMAL) LoggerHook
--------------------
after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
before_test_epoch:
(NORMAL ) IterTimerHook
--------------------
before_test_iter:
(NORMAL ) IterTimerHook
--------------------
after_test_iter:
(NORMAL ) IterTimerHook
(NORMAL ) DetVisualizationHook
(NORMAL ) MemoryProfilerHook
(BELOW_NORMAL) LoggerHook
--------------------
after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
--------------------
after_run:
(BELOW_NORMAL) LoggerHook
--------------------
loading annotations into memory...
loading annotations into memory...loading annotations into memory...
loading annotations into memory...
Done (t=13.93s)
creating index...
Done (t=14.04s)
creating index...
Done (t=14.10s)
creating index...
index created!
index created!
index created!
Done (t=14.40s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
Done (t=0.51s)
creating index...
index created!
Done (t=0.55s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.53s)
creating index...
index created!
Done (t=0.53s)
creating index...
index created!
Done (t=0.56s)
creating index...
index created!
loading annotations into memory...
Done (t=0.50s)
creating index...
index created!
loading annotations into memory...
Done (t=0.50s)
creating index...
index created!
05/10 20:32:06 - mmengine - INFO - load model from: torchvision://resnet50
05/10 20:32:06 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50
05/10 20:32:06 - mmengine - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
05/10 20:32:06 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
05/10 20:32:06 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
05/10 20:32:06 - mmengine - INFO - Checkpoints will be saved to /home/zhaorui/CV-Code/mmdetection/work_dirs/detr_r50_8xb2-150e_coco.
05/10 20:32:22 - mmengine - INFO - Memory information available_memory: 180559 MB, used_memory: 74663 MB, memory_utilization: 29.9 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7954 MB
05/10 20:32:22 - mmengine - INFO - Epoch(train) [1][ 50/14659] lr: 1.0000e-04 eta: 8 days, 3:16:02 time: 0.3197 data_time: 0.0194 memory: 3969 grad_norm: 103.5423 loss: 54.9896 loss_cls: 2.1978 loss_bbox: 4.1494 loss_iou: 2.8131 d0.loss_cls: 2.2200 d0.loss_bbox: 4.1326 d0.loss_iou: 2.8068 d1.loss_cls: 2.2382 d1.loss_bbox: 4.1817 d1.loss_iou: 2.7990 d2.loss_cls: 2.2061 d2.loss_bbox: 4.1539 d2.loss_iou: 2.8136 d3.loss_cls: 2.2067 d3.loss_bbox: 4.1143 d3.loss_iou: 2.8101 d4.loss_cls: 2.1981 d4.loss_bbox: 4.1250 d4.loss_iou: 2.8234
05/10 20:32:35 - mmengine - INFO - Memory information available_memory: 179166 MB, used_memory: 76056 MB, memory_utilization: 30.4 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 8327 MB
05/10 20:32:35 - mmengine - INFO - Epoch(train) [1][ 100/14659] lr: 1.0000e-04 eta: 7 days, 8:54:15 time: 0.2596 data_time: 0.0212 memory: 3839 grad_norm: 220.0322 loss: 43.2984 loss_cls: 1.9355 loss_bbox: 2.9958 loss_iou: 2.3072 d0.loss_cls: 1.9224 d0.loss_bbox: 2.9365 d0.loss_iou: 2.2635 d1.loss_cls: 1.9287 d1.loss_bbox: 2.9846 d1.loss_iou: 2.2853 d2.loss_cls: 1.9232 d2.loss_bbox: 3.0551 d2.loss_iou: 2.3758 d3.loss_cls: 1.9075 d3.loss_bbox: 2.9419 d3.loss_iou: 2.2876 d4.loss_cls: 1.9244 d4.loss_bbox: 2.9972 d4.loss_iou: 2.3263
05/10 20:32:48 - mmengine - INFO - Memory information available_memory: 177814 MB, used_memory: 77409 MB, memory_utilization: 31.0 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 8696 MB
05/10 20:32:48 - mmengine - INFO - Epoch(train) [1][ 150/14659] lr: 1.0000e-04 eta: 7 days, 2:31:53 time: 0.2584 data_time: 0.0214 memory: 3699 grad_norm: 504.9538 loss: 34.0111 loss_cls: 2.0356 loss_bbox: 2.0061 loss_iou: 1.6414 d0.loss_cls: 2.0125 d0.loss_bbox: 2.0345 d0.loss_iou: 1.6714 d1.loss_cls: 2.0591 d1.loss_bbox: 1.9648 d1.loss_iou: 1.6293 d2.loss_cls: 2.0408 d2.loss_bbox: 1.9693 d2.loss_iou: 1.6279 d3.loss_cls: 2.0374 d3.loss_bbox: 1.9823 d3.loss_iou: 1.6434 d4.loss_cls: 2.0304 d4.loss_bbox: 1.9886 d4.loss_iou: 1.6364
05/10 20:33:01 - mmengine - INFO - Memory information available_memory: 176527 MB, used_memory: 78695 MB, memory_utilization: 31.5 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9010 MB
05/10 20:33:01 - mmengine - INFO - Epoch(train) [1][ 200/14659] lr: 1.0000e-04 eta: 6 days, 23:42:32 time: 0.2608 data_time: 0.0210 memory: 3932 grad_norm: 452.9922 loss: 30.2156 loss_cls: 1.9281 loss_bbox: 1.6330 loss_iou: 1.5132 d0.loss_cls: 1.9302 d0.loss_bbox: 1.6076 d0.loss_iou: 1.5465 d1.loss_cls: 1.9314 d1.loss_bbox: 1.5242 d1.loss_iou: 1.5450 d2.loss_cls: 1.9316 d2.loss_bbox: 1.5356 d2.loss_iou: 1.5310 d3.loss_cls: 1.9375 d3.loss_bbox: 1.5483 d3.loss_iou: 1.5167 d4.loss_cls: 1.9255 d4.loss_bbox: 1.5909 d4.loss_iou: 1.5394
05/10 20:33:14 - mmengine - INFO - Memory information available_memory: 175384 MB, used_memory: 79839 MB, memory_utilization: 31.9 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9274 MB
05/10 20:33:14 - mmengine - INFO - Epoch(train) [1][ 250/14659] lr: 1.0000e-04 eta: 6 days, 21:07:34 time: 0.2535 data_time: 0.0206 memory: 3988 grad_norm: 400.0951 loss: 28.9441 loss_cls: 1.8622 loss_bbox: 1.4915 loss_iou: 1.4709 d0.loss_cls: 1.8810 d0.loss_bbox: 1.5435 d0.loss_iou: 1.5183 d1.loss_cls: 1.8855 d1.loss_bbox: 1.4720 d1.loss_iou: 1.4749 d2.loss_cls: 1.8688 d2.loss_bbox: 1.4429 d2.loss_iou: 1.4723 d3.loss_cls: 1.8796 d3.loss_bbox: 1.4478 d3.loss_iou: 1.4748 d4.loss_cls: 1.8615 d4.loss_bbox: 1.4564 d4.loss_iou: 1.4403
05/10 20:33:26 - mmengine - INFO - Memory information available_memory: 174363 MB, used_memory: 80859 MB, memory_utilization: 32.3 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9527 MB
05/10 20:33:26 - mmengine - INFO - Epoch(train) [1][ 300/14659] lr: 1.0000e-04 eta: 6 days, 18:58:01 time: 0.2492 data_time: 0.0205 memory: 3686 grad_norm: 338.3264 loss: 29.3254 loss_cls: 1.8864 loss_bbox: 1.4218 loss_iou: 1.6279 d0.loss_cls: 1.9069 d0.loss_bbox: 1.4351 d0.loss_iou: 1.5304 d1.loss_cls: 1.9050 d1.loss_bbox: 1.3700 d1.loss_iou: 1.5609 d2.loss_cls: 1.8926 d2.loss_bbox: 1.4070 d2.loss_iou: 1.6077 d3.loss_cls: 1.9016 d3.loss_bbox: 1.3791 d3.loss_iou: 1.6020 d4.loss_cls: 1.8913 d4.loss_bbox: 1.4098 d4.loss_iou: 1.5897
05/10 20:33:39 - mmengine - INFO - Memory information available_memory: 173251 MB, used_memory: 81971 MB, memory_utilization: 32.7 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9795 MB
05/10 20:33:39 - mmengine - INFO - Epoch(train) [1][ 350/14659] lr: 1.0000e-04 eta: 6 days, 17:44:55 time: 0.2529 data_time: 0.0215 memory: 3699 grad_norm: 339.1143 loss: 34.1119 loss_cls: 1.9958 loss_bbox: 1.5808 loss_iou: 1.9624 d0.loss_cls: 2.0374 d0.loss_bbox: 1.7025 d0.loss_iou: 2.0307 d1.loss_cls: 2.0513 d1.loss_bbox: 1.6896 d1.loss_iou: 2.0143 d2.loss_cls: 2.0113 d2.loss_bbox: 1.6945 d2.loss_iou: 2.0548 d3.loss_cls: 2.0207 d3.loss_bbox: 1.6315 d3.loss_iou: 2.0241 d4.loss_cls: 2.0149 d4.loss_bbox: 1.6224 d4.loss_iou: 1.9731
05/10 20:33:51 - mmengine - INFO - Memory information available_memory: 172380 MB, used_memory: 82842 MB, memory_utilization: 33.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10014 MB
05/10 20:33:51 - mmengine - INFO - Epoch(train) [1][ 400/14659] lr: 1.0000e-04 eta: 6 days, 16:20:14 time: 0.2464 data_time: 0.0211 memory: 3687 grad_norm: 275.3709 loss: 32.7396 loss_cls: 1.9504 loss_bbox: 1.6224 loss_iou: 1.8994 d0.loss_cls: 1.9714 d0.loss_bbox: 1.6036 d0.loss_iou: 1.9065 d1.loss_cls: 1.9478 d1.loss_bbox: 1.6063 d1.loss_iou: 1.8971 d2.loss_cls: 1.9491 d2.loss_bbox: 1.5973 d2.loss_iou: 1.8940 d3.loss_cls: 1.9593 d3.loss_bbox: 1.5845 d3.loss_iou: 1.9143 d4.loss_cls: 1.9638 d4.loss_bbox: 1.5691 d4.loss_iou: 1.9033
05/10 20:34:04 - mmengine - INFO - Memory information available_memory: 171473 MB, used_memory: 83749 MB, memory_utilization: 33.4 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10255 MB
05/10 20:34:04 - mmengine - INFO - Epoch(train) [1][ 450/14659] lr: 1.0000e-04 eta: 6 days, 15:17:40 time: 0.2472 data_time: 0.0215 memory: 3699 grad_norm: 227.7132 loss: 28.8479 loss_cls: 1.7894 loss_bbox: 1.4295 loss_iou: 1.6153 d0.loss_cls: 1.7946 d0.loss_bbox: 1.5115 d0.loss_iou: 1.6341 d1.loss_cls: 1.7866 d1.loss_bbox: 1.3678 d1.loss_iou: 1.5725 d2.loss_cls: 1.7963 d2.loss_bbox: 1.3832 d2.loss_iou: 1.5834 d3.loss_cls: 1.8030 d3.loss_bbox: 1.4149 d3.loss_iou: 1.5845 d4.loss_cls: 1.8074 d4.loss_bbox: 1.3846 d4.loss_iou: 1.5894
05/10 20:34:16 - mmengine - INFO - Memory information available_memory: 170569 MB, used_memory: 84653 MB, memory_utilization: 33.8 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10492 MB
05/10 20:34:16 - mmengine - INFO - Epoch(train) [1][ 500/14659] lr: 1.0000e-04 eta: 6 days, 14:26:44 time: 0.2470 data_time: 0.0212 memory: 3543 grad_norm: 230.8249 loss: 29.3416 loss_cls: 1.9113 loss_bbox: 1.4269 loss_iou: 1.5849 d0.loss_cls: 1.9161 d0.loss_bbox: 1.4668 d0.loss_iou: 1.5625 d1.loss_cls: 1.9111 d1.loss_bbox: 1.3832 d1.loss_iou: 1.5550 d2.loss_cls: 1.9230 d2.loss_bbox: 1.3968 d2.loss_iou: 1.5339 d3.loss_cls: 1.9316 d3.loss_bbox: 1.3564 d3.loss_iou: 1.5530 d4.loss_cls: 1.9373 d4.loss_bbox: 1.4169 d4.loss_iou: 1.5750
05/10 20:34:29 - mmengine - INFO - Memory information available_memory: 169724 MB, used_memory: 85498 MB, memory_utilization: 34.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10657 MB
05/10 20:34:29 - mmengine - INFO - Epoch(train) [1][ 550/14659] lr: 1.0000e-04 eta: 6 days, 13:59:01 time: 0.2512 data_time: 0.0205 memory: 3687 grad_norm: 200.3849 loss: 32.0705 loss_cls: 2.0306 loss_bbox: 1.5291 loss_iou: 1.8789 d0.loss_cls: 2.0774 d0.loss_bbox: 1.4849 d0.loss_iou: 1.8166 d1.loss_cls: 2.0511 d1.loss_bbox: 1.4676 d1.loss_iou: 1.8464 d2.loss_cls: 2.0474 d2.loss_bbox: 1.4644 d2.loss_iou: 1.7998 d3.loss_cls: 2.0448 d3.loss_bbox: 1.4416 d3.loss_iou: 1.7963 d4.loss_cls: 2.0288 d4.loss_bbox: 1.4541 d4.loss_iou: 1.8108
05/10 20:34:42 - mmengine - INFO - Memory information available_memory: 169022 MB, used_memory: 86200 MB, memory_utilization: 34.4 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10780 MB
05/10 20:34:42 - mmengine - INFO - Epoch(train) [1][ 600/14659] lr: 1.0000e-04 eta: 6 days, 14:24:42 time: 0.2672 data_time: 0.0204 memory: 3698 grad_norm: 203.3744 loss: 29.0109 loss_cls: 1.8426 loss_bbox: 1.3323 loss_iou: 1.6706 d0.loss_cls: 1.8724 d0.loss_bbox: 1.3356 d0.loss_iou: 1.6584 d1.loss_cls: 1.8487 d1.loss_bbox: 1.3364 d1.loss_iou: 1.6912 d2.loss_cls: 1.8483 d2.loss_bbox: 1.2971 d2.loss_iou: 1.6323 d3.loss_cls: 1.8533 d3.loss_bbox: 1.3045 d3.loss_iou: 1.6620 d4.loss_cls: 1.8347 d4.loss_bbox: 1.3091 d4.loss_iou: 1.6814
05/10 20:34:54 - mmengine - INFO - Memory information available_memory: 168241 MB, used_memory: 86981 MB, memory_utilization: 34.7 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10965 MB
05/10 20:34:54 - mmengine - INFO - Epoch(train) [1][ 650/14659] lr: 1.0000e-04 eta: 6 days, 13:33:11 time: 0.2412 data_time: 0.0206 memory: 3756 grad_norm: 192.4306 loss: 34.5331 loss_cls: 2.1073 loss_bbox: 1.7186 loss_iou: 2.0007 d0.loss_cls: 2.1383 d0.loss_bbox: 1.6822 d0.loss_iou: 1.9588 d1.loss_cls: 2.1329 d1.loss_bbox: 1.6375 d1.loss_iou: 1.9499 d2.loss_cls: 2.1282 d2.loss_bbox: 1.6828 d2.loss_iou: 1.9731 d3.loss_cls: 2.1248 d3.loss_bbox: 1.6090 d3.loss_iou: 1.9580 d4.loss_cls: 2.1133 d4.loss_bbox: 1.6562 d4.loss_iou: 1.9615
I have also discovered this problem. Have you solved it?
I have also discovered this problem. Have you solved it? After 24 hours of training (DETR) on 8*2080Ti, the memory useage was over 400GB!
I have also discovered this problem with pytorch version 2.0. Have you solved it? After 24 hours of training (DETR) on 8*2080Ti, the memory useage was over 400GB!
When I use pytorch version 1.13, the memory doesn't overflow anymore.
WOW!! thank you @mypydl !!
You're a life saver!! :D
我发现了是RandomCrop的问题,类detr中数据增强使用了RandomCrop会引起CPU内存泄露,我去掉就没有这个情况了。