cannot use with distributed pytorch
the id "id name i gave" already exists by one process so rest all workers stop.
@XJay18
Hi,
If you are using multiple gpus, you should modify the parameter --nproc_per_node in the training scripts. For example, for training with 2 gpus:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port 12345 train.py --config path/to/config.yml
Meanwhile, please make sure the entry id defined in the config yaml file is unique for each experiment. We create a unique folder named ${model_name}/${id}, so if the id is duplicated (given that ${model_name} is not changed), the program cannot create the logging folder. If that is the case, you should either delete the previous logging folder with the same id, or use another id for creating a new logging folder.