character-bert-pretraining
character-bert-pretraining copied to clipboard
Training script hangs at torch.Distributed.init_process_group
The training script hangs at this line. And does nothing after that.
I have looked up this suggestion and tried to set word size, Address and Port before that but that didn't work as well. https://stackoverflow.com/a/66622440 Issue 1: It will hang unless you pass in nprocs=world_size to mp.spawn(). In other words, it's waiting for the "whole world" to show up, process-wise. Issue 2: The MASTER_ADDR and MASTER_PORT need to be the same in each process' environment and need to be a free address:port combination on the machine where the process with rank 0 will be run.
I am running RTX 3090 on a linux (centOS) system, Cuda 10.2, pytorch 1.7.1, python 3.8 and the apex version mentioned in the readme of this repo.