mmpose
mmpose copied to clipboard
Extra memory consumption in training process
In training, there are 8 extra processes who occupy memory of the same one gpu. This limits the batch size of the training process.
It seems that you are using dist_train.sh
to train the models. These processes are the DataLoader workers.
It is highly recommended to use slurm_train.sh
instead of dist_train.sh
, even in single-machine training settings. slurm_train
use DistributedDataParallel
which is much more efficient than DataParallel
.
@jin-s13 ok
https://github.com/open-mmlab/mmpose/blob/master/mmpose/models/backbones/utils/utils.py
map_location= 'cpu' can not help avoiding the extra gpu memory consumption when using
pretrained='https://download.openmmlab.com/mmpose/pretrain_models/hrnet_w32-36af842e.pth',
it works only when I downloaded the Pretrained Model and used a local address:
pretrained="/mnt/cephfs/algorithm/junjie.huang/models/mmpose/hrnet_w32-36af842e.pth",
emmm, amazing
Looks like an mmcv bug.