mmengine [Bug] MMDistributedDataParallel have no effect

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

myenv is: conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch pip install -U openmim -i https://pypi.tuna.tsinghua.edu.cn/simple mim install mmengine mim install mmcv==2.1.0 mim install mmdet==3.2.0

and I use mmdet3d==1.3.0 to train centerpoint, I find problem blew.

when i use batch=1, gpu=1 to train, 1 iter model forward time cost is 48ms, when i use batch=6, gpu=1 to train, 1 iter model forward time cost is 630ms( there 6 pointdata infer once time.), I check the mmengine code, when gpu_size==1, model is original model, and with no MMDataParallel like wrapper on it. when i use batch=6, gpu=4 to train, 1 iter model forward time cost is like below, max about is 800ms, min is 270ms loss time is 0.7451076507568359s cuda is 0 loss time is 0.7903275489807129s cuda is 2 loss time is 0.7789270877838135s cuda is 3 loss time is 0.8002550601959229s cuda is 1 loss time is 0.7657649517059326s cuda is 3 loss time is 0.7790787220001221s cuda is 0 loss time is 0.7731163501739502s cuda is 2 loss time is 0.9755470752716064s cuda is 1 loss time is 0.7812612056732178s cuda is 3 loss time is 0.7439537048339844s cuda is 0 loss time is 0.7788212299346924s cuda is 2 loss time is 0.8221783638000488s cuda is 1 loss time is 0.2801499366760254s cuda is 3 loss time is 0.2814157009124756s cuda is 0 loss time is 0.2818455696105957s cuda is 1 loss time is 0.2732048034667969s cuda is 2 but I check same config on mmcv, the time cost is only 200ms there may have something my config is not correct.

Reproduces the problem - code sample

my train 4 gpu commond is CUDA_VISIBLE_DEVICES=4,5,6,7 tools/dist_train.sh configs/zd_test_speed/net_202312.py 4 my single test commond is (with env set CUDA_VISIBLE_DEVICES=4) python train.py configs/zd_test_speed/net_202312.py

Reproduces the problem - command or script

script is dist_train.sh original on mmdet3d

Reproduces the problem - error message

NA

Additional information

No response

Dec 21 '23 14:12 doodoo0006

add. when i use batch=1, gpu=1 to train, 1 iter model forward time cost is 48ms, I test this config on mmcv, the 1 data infer time also is 48ms thank you

Dec 21 '23 14:12 doodoo0006

Hi @doodoo0006 , did you try to use nvidia-smi to see the usage of GPUs?

Dec 23 '23 04:12 zhouzaida

@zhouzaida batch = 1, gpu = 1, full-train time is 6days， nvidia-smi Memory-Usage is 8.57G batch = 1, gpu = 4, full-train time is 1days18h， nvidia-smi Memory-Usage is 8.57G * 4 batch = 4, gpu = 4, full-train time is 2days(not 1day18h vs batch1gput4)， nvidia-smi Memory-Usage is 22.9G * 4 batch = 4, gpu = 1, full-train time is 6days， nvidia-smi Memory-Usage is 22.5G * 1 batch = 6, gpu = 1, full-train time is 6days， nvidia-smi Memory-Usage is 32.2G * 1 batch = 6, gpu = 4, full-train time is 1days23h， nvidia-smi Memory-Usage is 32.2G * 4