trackformer icon indicating copy to clipboard operation
trackformer copied to clipboard

How to do multiple distributed training at the same time?

Open XAVILLA opened this issue 4 years ago • 2 comments

Hi, I'm trying to use distributed training. However I'm unable to run more than 1 experiment at the same time cuz the 'address already in use' error. What should I change the dist_url to be in order to launch multiple distributed training experiments at the same time?

XAVILLA avatar Jun 23 '21 21:06 XAVILLA

Hi,I met the same problem. Have you resovled this ?

diaozhuo99 avatar Jun 30 '21 12:06 diaozhuo99

Please make yourself familiar with the PyTorch distributed package. In train.py we call the utils.init_distributed_mode method which initialises distributed mode with init_mode=args.dist_url which in turn is set to env://. This means runs are initialised by environment variables. See Environment variable initialization on this page.

TLDR: Set MASTER_ADDR environment variable to something unique for each run.

timmeinhardt avatar Jul 05 '21 10:07 timmeinhardt