FloWaveNet icon indicating copy to clipboard operation
FloWaveNet copied to clipboard

PyTorch v1.0.0 multi-GPU compatibility issue

Open L0SG opened this issue 5 years ago • 5 comments

Currently, we cannot run the multi-GPU training on PyTorch v1.0.0 due to a strange null gradient issue.

L0SG avatar Dec 21 '18 03:12 L0SG

Oh my God. I have trained on the multi-GPU version for one week with all of my four GPUs. In the params/flowavenet/ dir, only one checkpoint was generated.

Thanks for pointing out this.

candlewill avatar Dec 21 '18 08:12 candlewill

Oops, sorry about the delayed issue post in this repo. Filed the report to the PyTorch repo about two weeks ago, so please stick to v0.4.1 until the issue is resolved.

L0SG avatar Dec 21 '18 08:12 L0SG

Update: the issue still persists in the latest 1.0.1 release.

L0SG avatar Feb 12 '19 07:02 L0SG

Note: DistributedDataParallel implementation from @1ytic circumvents the multi-GPU issue, so please use train_apex.py of the master branch until the issue from DataParallel (from train.py) is resolved.

L0SG avatar Apr 23 '19 14:04 L0SG

Update: the issue was fixed with the 1.2.0 release. We'll keep this issue open for a while for a future reference.

L0SG avatar Oct 10 '19 04:10 L0SG