character-bert-pretraining icon indicating copy to clipboard operation
character-bert-pretraining copied to clipboard

Training script hangs at torch.Distributed.init_process_group

Open IstiaqAnsari opened this issue 3 years ago • 0 comments

The training script hangs at this line. And does nothing after that.

I have looked up this suggestion and tried to set word size, Address and Port before that but that didn't work as well. https://stackoverflow.com/a/66622440 Issue 1: It will hang unless you pass in nprocs=world_size to mp.spawn(). In other words, it's waiting for the "whole world" to show up, process-wise. Issue 2: The MASTER_ADDR and MASTER_PORT need to be the same in each process' environment and need to be a free address:port combination on the machine where the process with rank 0 will be run.

I am running RTX 3090 on a linux (centOS) system, Cuda 10.2, pytorch 1.7.1, python 3.8 and the apex version mentioned in the readme of this repo.

IstiaqAnsari avatar Dec 28 '21 07:12 IstiaqAnsari