cannot use with distributed pytorch

Open ucalyptus2 opened this issue 3 years ago • 2 comments

the id "id name i gave" already exists by one process so rest all workers stop.

Dec 08 '22 00:12 ucalyptus2

@XJay18

Dec 08 '22 00:12 ucalyptus2

Hi, If you are using multiple gpus, you should modify the parameter --nproc_per_node in the training scripts. For example, for training with 2 gpus:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port 12345 train.py --config path/to/config.yml

Meanwhile, please make sure the entry id defined in the config yaml file is unique for each experiment. We create a unique folder named ${model_name}/${id}, so if the id is duplicated (given that ${model_name} is not changed), the program cannot create the logging folder. If that is the case, you should either delete the previous logging folder with the same id, or use another id for creating a new logging folder.

Dec 08 '22 13:12 XJay18