deep-high-resolution-net.pytorch
deep-high-resolution-net.pytorch copied to clipboard
CUDA error: out of memory
我正在使用两块2080ti训练hrnet,出现了 Out of memory报错 每块显卡有11GB可用显存 我的配置文件是 " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml" 输出报错信息为: I‘m running the train.py with two 2080ti but got a "Out of memery" error. Each of the two gpus has 11GB memory available. My config file is " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml" The output infomation is:
=> creating output/coco/pose_hrnet/w32_256x192_adam_lr1e-3
=> creating log/coco/pose_hrnet/w32_256x192_adam_lr1e-3_2022-04-23-08-07
Namespace(cfg='experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml', dataDir='', logDir='', modelDir='', opts=[], prevModelDir='')
AUTO_RESUME: True
CUDNN:
BENCHMARK: True
DETERMINISTIC: False
ENABLED: True
DATASET:
COLOR_RGB: True
DATASET: coco
DATA_FORMAT: jpg
FLIP: True
HYBRID_JOINTS_TYPE:
NUM_JOINTS_HALF_BODY: 8
PROB_HALF_BODY: 0.3
ROOT: data/coco/
ROT_FACTOR: 45
SCALE_FACTOR: 0.35
SELECT_DATA: False
TEST_SET: val2017
TRAIN_SET: train2017
DATA_DIR:
DEBUG:
DEBUG: True
SAVE_BATCH_IMAGES_GT: True
SAVE_BATCH_IMAGES_PRED: True
SAVE_HEATMAPS_GT: True
SAVE_HEATMAPS_PRED: True
GPUS: (2, 5)
LOG_DIR: log
LOSS:
TOPK: 8
USE_DIFFERENT_JOINTS_WEIGHT: False
USE_OHKM: False
USE_TARGET_WEIGHT: True
MODEL:
EXTRA:
FINAL_CONV_KERNEL: 1
PRETRAINED_LAYERS: ['conv1', 'bn1', 'conv2', 'bn2', 'layer1', 'transition1', 'stage2', 'transition2', 'stage3', 'transition3', 'stage4']
STAGE2:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4]
NUM_BRANCHES: 2
NUM_CHANNELS: [32, 64]
NUM_MODULES: 1
STAGE3:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4]
NUM_BRANCHES: 3
NUM_CHANNELS: [32, 64, 128]
NUM_MODULES: 4
STAGE4:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4, 4]
NUM_BRANCHES: 4
NUM_CHANNELS: [32, 64, 128, 256]
NUM_MODULES: 3
HEATMAP_SIZE: [48, 64]
IMAGE_SIZE: [192, 256]
INIT_WEIGHTS: True
NAME: pose_hrnet
NUM_JOINTS: 17
PRETRAINED: models/pytorch/imagenet/hrnet_w32-36af842e.pth
SIGMA: 2
TAG_PER_JOINT: True
TARGET_TYPE: gaussian
OUTPUT_DIR: output
PIN_MEMORY: True
PRINT_FREQ: 100
RANK: 0
TEST:
BATCH_SIZE_PER_GPU: 32
BBOX_THRE: 1.0
COCO_BBOX_FILE: data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json
FLIP_TEST: True
IMAGE_THRE: 0.0
IN_VIS_THRE: 0.2
MODEL_FILE:
NMS_THRE: 1.0
OKS_THRE: 0.9
POST_PROCESS: True
SHIFT_HEATMAP: True
SOFT_NMS: False
USE_GT_BBOX: True
TRAIN:
BATCH_SIZE_PER_GPU: 2
BEGIN_EPOCH: 0
CHECKPOINT:
END_EPOCH: 210
GAMMA1: 0.99
GAMMA2: 0.0
LR: 0.001
LR_FACTOR: 0.1
LR_STEP: [170, 200]
MOMENTUM: 0.9
NESTEROV: False
OPTIMIZER: adam
RESUME: False
SHUFFLE: True
WD: 0.0001
WORKERS: 24
=> init weights from normal distribution
Traceback (most recent call last):
File "tools/train.py", line 223, in
@yanyixuan2000 did you find a solution? I have the same error with CUDA version 11.7 and pytoch cuda 11.6.
Trying using batch size BATCH_SIZE_PER_GPU: 16 or 8
batch_size调低一点,在yaml文件里面