DEIM icon indicating copy to clipboard operation
DEIM copied to clipboard

Encountering "Expected to have finished reduction in the prior iteration before starting a new one" Error During Training

Open LemonWei111 opened this issue 11 months ago • 7 comments

I'm encountering an issue while attempting to finetune a model on my dataset. I run the command as follows: torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/deim_dfine/deim_hgnetv2_l_coco.yml --use-amp --seed 42 -t deim_dfine_hgnetv2_l_coco_50e.pth

The error message I receive is as follows:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/env1/DEIM/train.py", line 95, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/env1/DEIM/train.py", line 65, in main
[rank0]:     solver.fit()
[rank0]:   File "/home/env1/DEIM/engine/solver/det_solver.py", line 76, in fit
[rank0]:     train_stats = train_one_epoch(
[rank0]:   File "/home/env1/DEIM/engine/solver/det_engine.py", line 58, in train_one_epoch
[rank0]:     outputs = model(samples, targets=targets)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1528, in _pre_forward
[rank0]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank0]: making sure all `forward` function outputs participate in calculating loss. 
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 384
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
E0214 10:18:44.597429 2431079 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2431105) of binary: /home/anaconda3/envs/env1/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/env1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-14_10:18:44
  host      : llmserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2431105)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I've tried to set the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL for more information, but haven't found a clear solution.

I suspect this might be due to the complexity of the DEIM model's forward pass return values, which may prevent DDP from correctly tracking gradient updates for all parameters. If anyone has encountered a similar problem or has any suggestions, your help would be greatly appreciated!

LemonWei111 avatar Feb 14 '25 02:02 LemonWei111

Meanwhile, on this custom dataset, the model began to converge to the local optimum around the 3rd epoch, and training almost stopped.

LemonWei111 avatar Feb 14 '25 03:02 LemonWei111

A specific parameter (decoder.denoising_class_embed.weight) does not receive gradients.

LemonWei111 avatar Feb 14 '25 03:02 LemonWei111

@LemonWei111 Have you solved the issue? I have the same problem now.

mitzy-ByteMe avatar Feb 14 '25 14:02 mitzy-ByteMe

Hi, thank you so much for your interest in our work! In my experience, this kind of issue typically arises because some parameters are not involved in gradient computation. Please confirm two things:

  1. Have you made any modifications to the DEIM model? This includes adding new modules.
  2. Are the number of classes aligned? For example, if you’re finetuning on a custom dataset with only 10 categories, but COCO defaults to 80, and you just followed the COCO dataset.

If you’ve confirmed that no changes have been made in these two areas, could you try training on COCO and see if the issue persists? We’ve never encountered this problem with this code. If it still occurs, please provide more details so we can work together to resolve it.

ShihuaHuang95 avatar Mar 04 '25 03:03 ShihuaHuang95

Hi, thank you so much for your interest in our work! In my experience, this kind of issue typically arises because some parameters are not involved in gradient computation. Please confirm two things:

1. Have you made any modifications to the DEIM model? This includes adding new modules.

2. Are the number of classes aligned? For example, if you’re finetuning on a custom dataset with only 10 categories, but COCO defaults to 80, and you just followed the COCO dataset.

If you’ve confirmed that no changes have been made in these two areas, could you try training on COCO and see if the issue persists? We’ve never encountered this problem with this code. If it still occurs, please provide more details so we can work together to resolve it.

Hi, I also face the same problem 1) I did not make any changes to the DEIM model and 2) I can run the training in python (python train.py -c /home/user/DEIM/configs/deim_dfine/dfine_hgnetv2_x_coco.yml --use-amp --seed=0) but not torchrun (CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c /home/user/DEIM/configs/deim_dfine/dfine_hgnetv2_x_coco.yml --use-amp --seed=0) which means that the COCO dataset is ok right?

Also is torchrun necessary if I am running with 1 GPU? Thanks!

dplgcv avatar Mar 18 '25 02:03 dplgcv

https://github.com/ShihuaHuang95/DEIM/blob/c7ed52d0e80e25bf9675d7cca2e35a209fccee87/configs/runtime.yml#L7

Before

find_unused_parameters: False

After

find_unused_parameters: True

PINTO0309 avatar Mar 27 '25 15:03 PINTO0309

If you encounter a frame with no GT, the denoising part of the model will not go through dn.class_embed() for one GPU of your DDP setup, thus raising an error for gradient synchronization.

btroussel avatar Oct 30 '25 11:10 btroussel