composer
composer copied to clipboard
DeepLabv3 crashes when run with 1 GPU
To reproduce
The command below crashes:
composer -n 1 examples/run_composer_trainer.py -f composer/yamls/models/deeplabv3_ade20k_unoptimized.yaml --seed 20 --train_dataset.ade20k.datadir $DATADIR --val_dataset.ade20k.datadir $DATADIR --model.deeplabv3.use_plus false
Traceback:
Traceback (most recent call last):
File "examples/run_composer_trainer.py", line 60, in <module>
main()
File "examples/run_composer_trainer.py", line 56, in main
trainer.fit()
File "/root/composer/composer/trainer/trainer.py", line 804, in fit
self._train_loop()
File "/root/composer/composer/trainer/trainer.py", line 964, in _train_loop
total_loss = self._train_batch(microbatches)
File "/root/composer/composer/trainer/trainer.py", line 1045, in _train_batch
return self._train_batch_inner(microbatches)
File "/root/composer/composer/trainer/trainer.py", line 1075, in _train_batch_inner
self.state.outputs = self.state.model.forward(self.state.batch)
File "/root/composer/composer/models/deeplabv3/deeplabv3.py", line 150, in forward
logits = self.model(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/root/composer/composer/models/deeplabv3/deeplabv3.py", line 25, in forward
features = self.backbone(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py", line 62, in forward
x = module(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/batchnorm.py", line 731, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 762, in get_world_size
return _get_group_size(group)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 276, in _get_group_size
default_pg = _get_default_group()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 372, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Additional context
This is likely due to sync batchnorm, but need to figure out the appropriate fix.
Should we always initialize torch.distributed
, even with a world size of 1?
I think using composer -n 1
is basically saying using torch.distributed
, so maybe?
I think there is a deeplab specific fix where if the world size is 1, force sync_bn
to be false (even if true in yaml). Would this be reasonable?
I think I've read there are benefits to using torch.distributed
even if there is only 1 node. But maybe I'm making that up?
This has been resolved :)