RECCE icon indicating copy to clipboard operation
RECCE copied to clipboard

cannot use with distributed pytorch

Open ucalyptus2 opened this issue 3 years ago • 2 comments

the id "id name i gave" already exists by one process so rest all workers stop.

ucalyptus2 avatar Dec 08 '22 00:12 ucalyptus2

@XJay18

ucalyptus2 avatar Dec 08 '22 00:12 ucalyptus2

Hi, If you are using multiple gpus, you should modify the parameter --nproc_per_node in the training scripts. For example, for training with 2 gpus:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port 12345 train.py --config path/to/config.yml

Meanwhile, please make sure the entry id defined in the config yaml file is unique for each experiment. We create a unique folder named ${model_name}/${id}, so if the id is duplicated (given that ${model_name} is not changed), the program cannot create the logging folder. If that is the case, you should either delete the previous logging folder with the same id, or use another id for creating a new logging folder.

XJay18 avatar Dec 08 '22 13:12 XJay18