taming-transformers icon indicating copy to clipboard operation
taming-transformers copied to clipboard

Custom training in Colab version, stuck on error --> LightningDistributedDataParallel' object has no attribute '_sync_params'

Open smithee77 opened this issue 2 years ago • 3 comments

Hi all, first of all thanks for your work and support. I'm trying to deploy a colab version for custom training, but I got stuck here:

_AttributeError: 'LightningDistributedDataParallel' object has no attribute 'sync_params' (full error below)

Any hint??

Thanks again


Lightning config trainer: distributed_backend: ddp gpus: 0,

| Name | Type | Params

0 | encoder | Encoder | 29.3 M 1 | decoder | Decoder | 42.4 M 2 | loss | VQLPIPSWithDiscriminator | 17.5 M 3 | quantize | VectorQuantizer2 | 262 K 4 | quant_conv | Conv2d | 65.8 K 5 | post_quant_conv | Conv2d | 65.8 K Validation sanity check: 0it [00:00, ?it/s]/content/taming-transformers/taming/data/utils.py:137: UserWarning: An output with one or more elements was resized since it had shape [983040], which does not match the required output shape [5, 256, 256, 3].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.) return torch.stack(batch, 0, out=out) /content/taming-transformers/taming/data/utils.py:137: UserWarning: An output with one or more elements was resized since it had shape [983040], which does not match the required output shape [5, 256, 256, 3].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.) return torch.stack(batch, 0, out=out) Summoning checkpoint. Traceback (most recent call last): File "main.py", line 565, in trainer.fit(model, data) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 445, in fit results = self.accelerator_backend.train() File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 282, in ddp_train results = self.train_or_test() File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test results = self.trainer.train() File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 467, in train self.run_sanity_check(self.get_model()) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 659, in run_sanity_check _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in run_evaluation output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 171, in evaluation_step output = self.trainer.accelerator_backend.validation_step(args) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 162, in validation_step output = self.training_step(args) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in training_step output = self.trainer.model(*args) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 163, in forward self._sync_params() File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1186, in getattr type(self).name, name)) AttributeError: 'LightningDistributedDataParallel' object has no attribute '_sync_params'


smithee77 avatar Apr 20 '22 19:04 smithee77

comment the line with " trainer_config["distributed_backend"] = "ddp" " on main.py, worked for me

thedarkzeno avatar Jun 01 '22 02:06 thedarkzeno

I has the same problem before, which was solved by strictly following the given environment file. My suggestion is that creating a new virtual environment and conda install the environment.yaml.

Crane-YU avatar Jul 14 '22 00:07 Crane-YU

hi smithee77, have you solved this problem?

ZhangJinian avatar May 21 '24 01:05 ZhangJinian