4D-Humans icon indicating copy to clipboard operation
4D-Humans copied to clipboard

(ddp) multi gpu training error: unused parameters

Open JiahaoPlus opened this issue 9 months ago • 4 comments

Thanks for the nice work!

Currently, I have a problem when I train HMR2 on Slurm with 8 GPUs using: python train.py exp_name=hmr2 data=mix_all experiment=hmr_vit_transformer launcher=slurm trainer=ddp

I got an error:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value strategy='ddp_find_unused_parameters_true' or by setting the flag in the strategy with strategy=DDPStrategy(find_unused_parameters=True).

If I set strategy='ddp_find_unused_parameters_true' according to the error message, it shows:

Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

How should I solve this?

It seems that using a single GPU ("trainer=gpu") won't have such an error.

JiahaoPlus avatar May 11 '24 02:05 JiahaoPlus

I saw the comments in "ddp.yaml":

use "ddp_spawn" instead of "ddp", it's slower but normal "ddp" currently doesn't work ideally with hydra

Has this issue been solved when training 4D-Humans?

JiahaoPlus avatar May 11 '24 02:05 JiahaoPlus

When you set trainer.strategy=ddp_find_unused_parameters_true, do you actually get an error message, or only this warning? If you are only getting a warning, but the code runs, it should be ok.

geopavlakos avatar May 13 '24 23:05 geopavlakos

Thanks for your reply. If I use "ddp_find_unused_parameters_true", I only get a warning, but I need to make the batch size much smaller to avoid the CUDA OOM problem, thus not so practical.

Have you found a setting that can work well on 8 GPUs?

JiahaoPlus avatar May 14 '24 02:05 JiahaoPlus

When you say "you need to make the batch size much smaller", what do you compare it with? On our end, we are currently running the code with this additional flag for the strategy.

geopavlakos avatar May 14 '24 05:05 geopavlakos