composer icon indicating copy to clipboard operation
composer copied to clipboard

DeepLabv3 crashes when run with 1 GPU

Open Landanjs opened this issue 2 years ago • 4 comments

To reproduce

The command below crashes: composer -n 1 examples/run_composer_trainer.py -f composer/yamls/models/deeplabv3_ade20k_unoptimized.yaml --seed 20 --train_dataset.ade20k.datadir $DATADIR --val_dataset.ade20k.datadir $DATADIR --model.deeplabv3.use_plus false

Traceback:

Traceback (most recent call last):
  File "examples/run_composer_trainer.py", line 60, in <module>
    main()
  File "examples/run_composer_trainer.py", line 56, in main
    trainer.fit()
  File "/root/composer/composer/trainer/trainer.py", line 804, in fit
    self._train_loop()
  File "/root/composer/composer/trainer/trainer.py", line 964, in _train_loop
    total_loss = self._train_batch(microbatches)
  File "/root/composer/composer/trainer/trainer.py", line 1045, in _train_batch
    return self._train_batch_inner(microbatches)
  File "/root/composer/composer/trainer/trainer.py", line 1075, in _train_batch_inner
    self.state.outputs = self.state.model.forward(self.state.batch)
  File "/root/composer/composer/models/deeplabv3/deeplabv3.py", line 150, in forward
    logits = self.model(x)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/composer/composer/models/deeplabv3/deeplabv3.py", line 25, in forward
    features = self.backbone(x)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py", line 62, in forward
    x = module(x)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/batchnorm.py", line 731, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 762, in get_world_size
    return _get_group_size(group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 276, in _get_group_size
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 372, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Additional context

This is likely due to sync batchnorm, but need to figure out the appropriate fix.

Landanjs avatar Mar 09 '22 22:03 Landanjs

Should we always initialize torch.distributed, even with a world size of 1?

ravi-mosaicml avatar Mar 30 '22 23:03 ravi-mosaicml

I think using composer -n 1 is basically saying using torch.distributed, so maybe?

hanlint avatar Mar 30 '22 23:03 hanlint

I think there is a deeplab specific fix where if the world size is 1, force sync_bn to be false (even if true in yaml). Would this be reasonable?

Landanjs avatar Mar 31 '22 00:03 Landanjs

I think I've read there are benefits to using torch.distributed even if there is only 1 node. But maybe I'm making that up?

Landanjs avatar Mar 31 '22 00:03 Landanjs

This has been resolved :)

mvpatel2000 avatar Nov 03 '22 03:11 mvpatel2000