mmpose icon indicating copy to clipboard operation
mmpose copied to clipboard

Extra memory consumption in training process

Open HuangJunJie2017 opened this issue 4 years ago • 4 comments

image In training, there are 8 extra processes who occupy memory of the same one gpu. This limits the batch size of the training process.

HuangJunJie2017 avatar Oct 20 '20 02:10 HuangJunJie2017

It seems that you are using dist_train.sh to train the models. These processes are the DataLoader workers.

It is highly recommended to use slurm_train.sh instead of dist_train.sh, even in single-machine training settings. slurm_train use DistributedDataParallel which is much more efficient than DataParallel.

jin-s13 avatar Oct 20 '20 02:10 jin-s13

@jin-s13 ok

HuangJunJie2017 avatar Oct 20 '20 02:10 HuangJunJie2017

https://github.com/open-mmlab/mmpose/blob/master/mmpose/models/backbones/utils/utils.py

map_location= 'cpu' can not help avoiding the extra gpu memory consumption when using

pretrained='https://download.openmmlab.com/mmpose/pretrain_models/hrnet_w32-36af842e.pth',

it works only when I downloaded the Pretrained Model and used a local address:

pretrained="/mnt/cephfs/algorithm/junjie.huang/models/mmpose/hrnet_w32-36af842e.pth",

emmm, amazing

HuangJunJie2017 avatar Nov 22 '20 07:11 HuangJunJie2017

Looks like an mmcv bug.

innerlee avatar Nov 22 '20 09:11 innerlee