How to do multiple distributed training at the same time?
Hi, I'm trying to use distributed training. However I'm unable to run more than 1 experiment at the same time cuz the 'address already in use' error. What should I change the dist_url to be in order to launch multiple distributed training experiments at the same time?
Hi,I met the same problem. Have you resovled this ?
Please make yourself familiar with the PyTorch distributed package. In train.py we call the utils.init_distributed_mode method which initialises distributed mode with init_mode=args.dist_url which in turn is set to env://. This means runs are initialised by environment variables. See Environment variable initialization on this page.
TLDR: Set MASTER_ADDR environment variable to something unique for each run.